Raw notes from the Storage Developer Conference 2015 (SNIA SDC 2015)

Notes and disclaimers:

  • This blog post contains raw notes for some of the SNIA’s SDC 2015 presentations (SNIA’s Storage Developers Conference 2015)
  • These notes were typed during the talks and they may include typos and my own misinterpretations.
  • Text in the bullets under each talk are quotes from the speaker or text from the speaker slides, not my personal opinion.
  • If you feel that I misquoted you or badly represented the content of a talk, please add a comment to the post.
  • I spent limited time fixing typos or correcting the text after the event. There are only so many hours in a day...
  • I have not attended all sessions (since there are many being delivered at a time, that would actually not be possible :-)…
  • SNIA usually posts the actual PDF decks a few weeks after the event. Attendees have access immediately.
  • You can find the event agenda at http://www.snia.org/events/storage-developer/agenda


Understanding the Intel/Micron 3D XPoint Memory
Jim Handy, General Director, Objective Analysis

  • Memory analyst, SSD analyst, blogs: http://thememoryguy.com, http://thessdguy.com
  • Not much information available since the announcement in July: http://newsroom.intel.com/docs/DOC-6713
  • Agenda: What? Why? Who? Is the world ready for it? Should I care? When?
  • What: Picture of the 3D XPoint concept (pronounced 3d-cross-point). Micron’s photograph of “the real thing”.
  • Intel has researched PCM for 45 years. Mentioned in an Intel article at “Electronics” in Sep 28, 1970.
  • The many elements that have been tried shown in the periodic table of elements.
  • NAND laid the path to the increased hierarchy levels. Showed prices of DRAM/NAND from 2001 to 2015. Gap is now 20x.
  • Comparing bandwidth to price per gigabytes for different storage technologies: Tape, HDD, SSD, 3D XPoint, DRAM, L3, L2, L1
  • Intel diagram mentions PCM-based DIMMs (far memory) and DDR DIMMs (near memory).
  • Chart with latency for HDD SAS/SATA, SSD SAS/SATA, SSD NVMe, 3D XPoint NVMe – how much of it is the media, how much is the software stack?
  • 3D Xpoint’s place in the memory/storage hierarchy. IOPS x Access time. DRAM, 3D XPoint (Optane), NVMe SSD, SATA SSD
  • Great gains at low queue depth. 800GB SSD read IOPS using 16GB die. IOPS x queue depth of NAND vs. 3D XPoint.
  • Economic benefits: measuring $/write IOPS for SAS HDD, SATA SSD, PCIe SSD, 3D XPoint
  • Timing is good because: DRAM is running out of speed, NVDIMMs are catching on, some sysadmins understand how to use flash to reduce DRAM needs
  • Timing is bad because: Nobody can make it economically, no software supports SCM (storage class memory), new layers take time to establish Why should I care: better cost/perf ratio, lower power consumption (less DRAM, more perf/server, lower OpEx), in-memory DB starts to make sense
  • When? Micron slide projects 3D XPoint at end of FY17 (two months ahead of CY). Same slide shows NAND production surpassing DRAM production in FY17.
  • Comparing average price per GB compared to the number of GB shipped over time. It takes a lot of shipments to lower price.
  • Looking at the impact in the DRAM industry if this actually happens. DRAM slows down dramatically starting in FY17, as 3D XPoint revenues increase (optimistic).


Next Generation Data Centers: Hyperconverged Architectures Impact On Storage
Mark OConnell, Distinguished Engineer, EMC

  • History: Client/Server –> shared SANs –> Scale-Out systems
  • >> Scale-Out systems: architecture, expansion, balancing
  • >> Evolution of the application platform: physical servers à virtualization à Virtualized application farm
  • >> Virtualized application farms and Storage: local storage à Shared Storage (SAN) à Scale-Out Storage à Hyper-converged
  • >> Early hyper-converged systems: HDFS (Hadoop) à JVM/Tasks/HDFS in every node
  • Effects of hyper-converged systems
  • >> Elasticity (compute/storage density varies)
  • >> App management, containers, app frameworks
  • >> Storage provisioning: frameworks (openstack swift/cinder/manila), pure service architectures
  • >> Hybrid cloud enablement. Apps as self-describing bundles. Storage as a dynamically bound service. Enables movement off-prem.


Implications of Emerging Storage Technologies on Massive Scale Simulation Based Visual Effects
Yahya H. Mirza, CEO/CTO, Aclectic Systems Inc

  • Steve Jobs quote: "You‘ve got to start with the customer experience and work back toward the technology".
  • Problem 1: Improve customer experience. Higher resolution, frame rate, throughput, etc.
  • Problem 2: Production cost continues to rise.
  • Problem 3: Time to render single frame remains constant.
  • Problem 4: Render farm power and cooling increasing. Coherent shared memory model.
  • How do you reduce customer CapEx/OpEx. Low efficiency: 30% CPU. Prooblem is memory access latency and I/O.
  • Production workflow: modeling, animation/simulation/shading, lighting, rendering, compositing. More and more simulation.
  • Concrete production experiment: 2005. Story boards. Attempt to create a short film. Putting himself in the customer’s shoes. Shot decomposition.
  • Real 3-minute short costs $2 million. Animatic to pitch the project.
  • Character modeling and development. Includes flesh and muscle simulation. A lot of it done procedurally.
  • Looking at Disney’s “Big Hero 6”, DreamWorks’ “Puss in Boots” and Weta’s “The Hobbit”, including simulation costs, frame rate, resolution, size of files, etc.
  • Physically based rendering: global illumination effects, reflection, shadows. Comes down to light transport simulation, physically based materials description.
  • Exemplary VFX shot pipeline. VFX Tool (Houdini/Maya), Voxelized Geometry (OpenVDB), Scene description (Alembic), Simulation Engine (PhysBam), Simulation Farm (RenderFarm), Simulation Output (OpenVDB), Rendering Engine (Mantra), Render Farm (RenderFarm), Output format (OpenEXR), Compositor (Flame), Long-term storage.
  • One example: smoke simulation – reference model smoke/fire VFX. Complicated physical model. Hotspot algorithms: monte-carlo integration, ray-intersection test, linear algebra solver (multigrid).
  • Storage implications. Compute storage (scene data, simulation data), Long term storage.
  • Is public cloud computing viable for high-end VFX?
  • Disney’s data center. 55K cores across 4 geos.
  • Vertically integrated systems are going to be more and more important. FPGAs, ARM-based servers.
  • Aclectic Colossus smoke demo. Showing 256x256x256.
  • We don’t want coherency; we don’t want sharing. Excited about Intel OmniPath.
  • http://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-fabric-overview.html


How Did Human Cells Build a Storage Engine?
Sanjay Joshi, CTO Life Sciences, EMC

  • Human cell, Nuclear DNA, Transcription and Translation, DNA Structure
  • The data structure: [char(3*10^9) human_genome] strand
  • 3 gigabases [(3*10^9)*2]/8 = ~750MB. With overlaps, ~1GB per cell. 15-70 trillion cells.
  • Actual files used to store genome are bigger, between 10GB and 4TB (includes lots of redundancy).
  • Genome sequencing will surpass all other data types by 2040
  • Protein coding portion is just a small portion of it. There’s a lot we don’t understand.
  • Nuclear DNA: Is it a file? Flat file system, distributed, asynchronous. Search header, interpret, compile, execute.
  • Nuclear DNA properties: Large:~20K genes/cell, Dynamic: append/overwrite/truncate, Semantics: strict, Consistent: No, Metadata: fixed, View: one-to-many
  • Mitochondrial DNA: Object? Distributed hash table, a ring with 32 partitions. Constant across generations.
  • Mitochondrial DNA: Small: ~40 genes/cell, Static: constancy, energy functions, Semantics: single origin, Consistent: Yes, Metadata: system based, View: one-to-one
  • File versus object. Comparing Nuclear DNA and Mitochondrial DNA characteristics.
  • The human body: 7500 names parts, 206 regularly occurring bones (newborns close to 300), ~640 skeletal muscles (320 pairs), 60+ organs, 37 trillion cells. Distributed cluster.
  • Mapping the ISO 7 layers to this system. Picture.
  • Finite state machine: max 10^45 states at 4*10^53 state-changes/sec. 10^24 NOPS (nucleotide ops per second) across biosphere.
  • Consensus in cell biology: Safety: under all conditions: apoptosis. Availability: billions of replicate copies. Not timing dependent: asynchronous. Command completion: 10 base errors in every 10,000 protein translation (10 AA/sec).
  • Object vs. file. Object: Maternal, Static, Haploid. Small, Simple, Energy, Early. File: Maternal and paternal, Diploid. Scalable, Dynamic, Complex. All cells are female first.


Move Objects to LTFS Tape Using HTTP Web Service Interface
Matt Starr, Chief Technical Officer, Spectra Logic
Jeff Braunstein, Developer Evangelist, Spectra Logic

  • Worldwide data growth: 2009 = 800 EB, 2015 = 6.5ZB, 2020 = 35ZB
  • Genomics. 6 cows = 1TB of data. They keep it forever.
  • Video data. SD to Full HD to 4K UHD (4.2TB per hours) to 8K UHD. Also kept forever.
  • Intel slide on the Internet minute. 90% of the people of the world never took a picture with anything but a camera phone.
  • IOT - Total digital info create or replicated.
  • $1000 genome scan take 780MB fully compressed. 2011 HiSeq-2000 scanner generates 20TB per month. Typical camera generates 105GB/day.
  • More and more examples.
  • Tape storage is the lowest cost. But it’s also complex to deploy. Comparing to Public and Private cloud…
  • Pitfalls of public cloud – chart of $/PB/day. OpEx per PB/day reaches very high for public cloud.
  • Risk of public cloud: Amazon has 1 trillion objects. If they lose 1% it would 10 billion objects.
  • Risk of public cloud: Nirvanix. VC pulled the plug in September 2013.
  • Cloud: Good: toolkits, naturally WAN friendly, user expectation: put it away.
  • What if: Combine S3/Object with tape. Spectra S3 – Front end is REST, backend is LTFS tape.
  • Cost: $.09/GB. 7.2PB. Potentially a $0.20 two-copy archive.
  • Automated: App or user-built. Semi-Automated: NFI or scripting.
  • Information available at https://developer.spectralogic.com
  • All the tools you need to get started. Including simulator of the front end (BlackPearl) in a VM.
  • S3 commands, plus data to write sequentially in bulk fashion.
  • Configure user for access, buckets.
  • Deep storage browser (source code on GitHub) allows you to browse the simulated storage.
  • SDK available in Java, C#, many others. Includes integration with Visual Studio (demonstrated).
  • Showing sample application. 4 lines of code from the SDK to move a folder to tape storage.
  • Q: Access times when not cached? Hours or minutes. Depends on if the tape is already in the drive. You can ask to pull those to cache, set priorities. By default GET has higher priority than PUT. 28TB or 56TB of cache.
  • Q: Can we use CIFS/NFS? Yes, there is an NFI (Network File Interface) using CIFS/NFS, which talks to the cache machine. Manages time-outs.
  • Q: Any protection against this being used as disk? System monitors health of the tape. Using an object-based interface helps.
  • Q: Can you stage a file for some time, like 24h? There is a large cache. But there are no guarantees on the latency. Keeping it on cache is more like Glacier. What’s the trigger to bring the data?
  • Q: Glacier? Considering support for it. Data policy to move to lower cost, move it back (takes time). Not a lot of product or customers demanding it. S3 has become the standard, not sure if Glacier will be that for archive.
  • Q: Drives are a precious resource. How do you handle overload? By default, reads have precedence over writes. Writes usually can wait.


Taxonomy of Differential Compression
Liwei Ren, Scientific Adviser, Trend Micro

  • Mathematical model for describing file differences
  • Lossless data compression categories: data compression (one file), differential compression (two files), data deduplication (multiple files)
  • Purposes: network data transfer acceleration and storage space reduction
  • Areas for DC – mobile phones’ firmware over the air, incremental update of files for security software, file synchronization and transfer over WAN, executable files
  • Math model – Diff procedure: Delta = T – R, Merge procedure: T = R + Delta. Model for reduced network bandwidth, reduced storage cost.
  • Applications: backup, revision control system, patch management, firmware over the air, malware signature update, file sync and transfer, distributed file system, cloud data migration
  • Diff model. Two operations: COPY (source address, size [, destination address] ), ADD (data block, size [, destination address] )
  • How to create the delta? How to encode the delta into a file? How to create the right sequence of COPY/ADD operations?
  • Top task is an effective algorithm to identify common blocks. Not covering it here, since it would take more than half an hour…
  • Modeling a diff package. Example.
  • How do you measure the efficiency of an algorithm? You need a cost model.
  • Categorizing: Local DC - LDC (xdelta, zdelta, bsdiff), Remote DC - RDC (rsync, RDC protocol, tsync), Iterative – IDC (proposed)
  • Categorizing: Not-in-place merging: general files (xdelta, zdelta, bsdiff), executable files (bsdiff, courgette)
  • Categorizing: In place merging: firmware as general files (FOTA), firmware as executable files (FOTA)
  • Topics in depth: LDC vs RDC vs IDC for general files
  • Topics in depth: LDC for executable files
  • Topics in depth: LDC for in-place merging


New Consistent Hashing Algorithms for Data Storage
Jason Resch, Software Architect, Cleversafe

  • Introducing a new algorithm for hashing.
  • Hashing is useful. Used commonly is distributed storage, distributed caching.
  • Independent users can coordinate (readers know where writers would write without talking to them).
  • Typically, resizing a Hash Table is inefficient. Showing example.
  • That’s why we need “Stable Hashing”. Showing example. Only a small portion of the keys need to be re-mapped.
  • Stable hashing becomes a necessity when system is stateful and/or transferring state is expensive,
  • Used in Caching/Routing (CARP), DHT/Storage (Gluster, DynamoDB, Cassandra, ceph, openstack)
  • Stable Hashing with Global Namespaces. If you have a file name, you know what node has the data.
  • Eliminates points of contention, no metadata systems. Namespace is fixed, but the system is dynamic.
  • Balances read/write load across nodes, as well as storage utilization across nodes.
  • Perfectly Stable Hashing (Rendezvous Hashing, Consistent Hashing). Precisely weighted (CARP, RUSH, CRUSH).
  • It would be nice to have something that would offer the characteristics of both.
  • Consistent: buckets inserted in random positions. Keys maps to the next node greater than that key. With a new node, only neighbors as disrupted. But neighbor has to send data to new node, might not distribute keys evenly.
  • Rendezvous: Score = Hash (Bucket ID || Key). Bucket with the highest score wins. When adding a new node, some of the keys will move to it. Every node is disrupted evenly.
  • CARP is rendezvous hashing with a twist. It multiples the scores by a “Load Factor” for each node. Allows for some nodes being more capable than others. Not perfectly stable: if node’s weighting changes or node is added, then all load factor must be recomputed.
  • RUSH/CRUSH: Hierarchical tree, with each node assigned a probability to go left/right. CRUSH makes the tree match the fault domains of the system. Efficient to add nodes, but not to remove or re-weight nodes.
  • New algorithm: Weighted Rendezvous Hashing (WRH). Both perfectly stable and precisely weighted.
  • WRH adjusts scores before weighting them. Unlike CARP, scores aren’t relatively scaled.
  • No unnecessary transfer of keys when adding/removing nodes. If adding node or increasing weight on node, other nodes will move keys to it, but nothing else. Transfers are equalized and perfectly efficient.
  • WRH is simple to implement. Whole python code showed in one slide.
  • All the magic is in one line: “Score = 1.0 / -math.log(hash_f)” - Proof of correctness provided for the math inclined.
  • How Cleversafe uses WRH. System is grown by set of devices. Devices have a lifecycle: added, possibly expanded, then retired.
  • Detailed explanation of the lifecycle and how keys move as nodes are added, expanded, retired.
  • Storage Resource Map. Includes weight, hash_seed. Hash seed enables a clever trick to retire device sets more efficiently.
  • Q: How to find data when things are being moved? If clients talk to the old node while keys are being moved. Old node will proxy the request to the new node.


Storage Class Memory Support in the Windows Operating System
Neal Christiansen, Principal Development Lead, Microsoft

  • Windows support for non-volatile storage medium with RAM-like performance is a big change.
  • Storage Class Memory (SCM): NVDIMM, 3D XPoint, others
  • Microsoft involved with the standardization efforts in this space.
  • New driver model necessary: SCM Bus Driver, SCM Disk Driver.
  • Windows Goals for SCM: Support zero-copy access, run most user-mode apps unmodified, option for 100% backward compatibility (new types of failure modes), sector granular failure modes for app compat.
  • Applications make lots of assumptions on the underlying storage
  • SCM Storage Drivers will support BTT – Block Translation Table. Provides sector-level atomicity for writes.
  • SCM is disruptive. Fastest performance and application compatibility can be conflicting goals.
  • SCM-aware File Systems for Windows. Volume modes: block mode or DAS mode (chosen at format time).
  • Block Mode Volumes – maintain existing semantics, full application compatibility
  • DAS Mode Volumes – introduce new concepts (memory mapped files, maximizes performance). Some existing functionality is lost. Supported by NTFS and ReFS.
  • Memory Mapped IO in DAS mode. Application can create a memory mapped section. Allowed when volumes resides on SCM hardware and the volume has been formatted for DAS mode.
  • Memory Mapped IO: True zero copy access. BTT is not used. No paging reads or paging writes.
  • Cached IO in DAS Mode: Cache manager creates a DAS-enabled cache map. Cache manager will copy directly between user’s buffer and SCM. Coherent with memory-mapped IO. App will see new failure patterns on power loss or system crash. No paging reads or paging writes.
  • Non-cached IO in DAS Mode. Will send IO down the storage stack to the SCM driver. Will use BTT. Maintains existing storage semantics.
  • If you really want the performance, you will need to change your code.
  • DAS mode eliminates traditional hook points used by the file system to implement features.
  • Features not in DAS Mode: NTFS encryption, NTS compression, NTFS TxF, ReFS integrity streams, ReFS cluster band, ReFS block cloning, Bitlocker volume encryption, snapshot via VolSnap, mirrored or parity via storage spaces or dynamic disks
  • Sparse files won’t be there initially but will come in the future.
  • Updated at the time the file is memory mapped: file modification time, mark file as modified in the USN journal, directory change notification
  • File System Filters in DAS mode: no notification that a DAS volume is mounted, filter will indicate via a flag if they understand DAS mode semantics.
  • Application compatibility with filters in DAS mode: No opportunity for data transformation filters (encryption, compression). Anti-virus are minimally impacted, but will need to watch for creation of writeable mapped sections (no paging writes anymore).
  • Intel NVLM library. Open source library implemented by Intel. Defines set of application APIs for directly manipulating files on SCM hardware.
  • NVLM library available for Linux today via GitHub. Microsoft working with Intel on a Windows port.
  • Q: XIP (Execute in place)? It’s important, but the plans have not solidified yet.
  • Q: NUMA? Can be in NUMA nodes. Typically, the file system and cache are agnostic to NUMA.
  • Q: Hyper-V? Not ready to talk about what we are doing in that area.
  • Q: Roll-out plan? We have one, but not ready to talk about it yet.
  • Q: Data forensics? We’ve yet to discuss this with that group. But we will.
  • Q: How far are you to completion? It’s running and working today. But it is not complete.
  • Q: Windows client? To begin, we’re targeting the server. Because it’s available there first.
  • Q: Effect on performance? When we’re ready to announce the schedule, we will announce the performance. The data about SCM is out there. It’s fast!
  • Q: Will you backport? Probably not. We generally move forward only. Not many systems with this kind of hardware will run a down level OS.
  • Q: What languages for the Windows port of NVML? Andy will cover that in his talk tomorrow.
  • Q: How fast will memory mapped be? Potentially as fast as DRAM, but depends on the underlying technology.


The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production
Sudipta Sengupta, Principal Research Scientist, Microsoft Research

  • The B-Tree: key-ordered access to records. Balanced tree via page split and merge mechanisms.
  • Design tenets: Lock free operation (high concurrency), log-structure storage (exploit flash devices with fast random reads and inefficient random writes), delta updates to pages (reduce cache invalidation, garbage creation)
  • Bw-Tree Architecture: 3 layers: B-Tree (expose API, B-tree search/update, in-memory pages), Cache (logical page abstraction, move between memory and flash), Flash (reads/writes from/to storage, storage management).
  • Mapping table: Expose logical pages to access method layer. Isolates updates to single page. Structure for lock-free multi-threaded concurrency control.
  • Highly concurrent page updates with Bw-Tree. Explaining the process using a diagram.
  • Bw-Tree Page Split: No hard threshold for splitting unlike in classical B-Tree. B-link structure allows “half-split” without locking.
  • Flash SSDs: Log-Structured storage. Use log structure to exploit the benefits of flash and work around its quirks: random reads are fast, random in-place writes are expensive.
  • LLAMA Log-Structured Store: Amortize cost of writes over many page updates. Random reads to fetch a “logical page”.
  • Depart from tradition: logical page formed by linking together records on multiple physical pages on flash. Adapted from SkimpyStash.
  • Detailed diagram comparing traditional page writing with the writing optimized storage organization with Bw-Tree.
  • LLAMA: Optimized Logical Page Reads. Multiple delta records are packed when flushed together. Pages consolidated periodically in memory also get consolidated on flash when flushed.
  • LLAMA: Garbage collection on flash. Two types of record units in the log: Valid or Orphaned. Garbage collection starts from the oldest portion of the log. Earliest written record on a logical page is encountered first.
  • LLAMA: cache layer. Responsible for moving pages back and forth from storage.
  • Bw-Tree Checkpointing: Need to flush to buffer and to storage. LLAMA checkpoint for fast recovery.
  • Bw-Tree Fast Recovery. Restore mapping table from latest checkpoint region. Warm-up using sequential I/O.
  • Bw-Tree: Support for transactions. Part of the Deuteronomy Architecture.
  • End-to-end crash recovery. Data component (DC) and transactional component (TC) recovery. DC happens before TC.
  • Bw-Tree in production: Key-sequential index in SQL Server in-memory database
  • Bw-Tree in production: Indexing engine in Azure DocumentDB. Resource governance is important (CPU, Memory, IOPS, Storage)
  • Bw-Tree in production: Sorted key-value store in Bing ObjectStore.
  • Summary: Classic B-Tree redesigned for modern hardware and cloud. Lock-free, delta updating of pages, log-structure, flexible resource governor, transactional. Shipping in production.
  • Going forward: Layer transactional component (Deuteronomy Architecture, CIDR 2015), open-source the codebase


ReFS v2: Cloning, Projecting, and Moving Data
J.R. Tipton, Development Lead, Microsoft

  • Agenda: ReFS v1 primer, ReFS v2 at a glance, motivations for ReFS v2, cloning, translation, transformation
  • ReFS v1 primer: Windows allocate-on-write file system, Merkel trees verify metadata integrity, online data correction from alternate copies, online chkdsk
  • ReFS v2: Available in Windows Server 2016 TP4. Efficient, reliable storage for VMs, efficient parity, write tiering, read caching, block cloning, optimizations
  • Motivations for ReFS v2: cheap storage does not mean slow, VM density, VM provisioning, more hardware flavors (SLC, MLC, TLC flash, SMR)
  • Write performance. Magic does not work in a few environments (super fast hardware, small random writes, durable writes/FUA/sync/write-through)
  • ReFS Block Cloning: Clone any block of one file into any other block in another file. Full file clone, reorder some or all data, project data from one area into another without copy
  • ReFS Block Cloning: Metadata only operation. Copy-on-write used when needed (ReFS knows when).
  • Cloning examples: deleting a Hyper-V VM checkpoint, VM provisioning from image.
  • Cloning observations: app directed, avoids data copies, metadata operations, Hyper-V is the first but not the only one using this
  • Cloning is no free lunch: multiple valid copies will copy-on-write upon changes. metadata overhead to track state, slam dunk in most cases, but not all
  • ReFS cluster bands. Volume internally divvied up into bands that contain regular FS clusters (4KB, 64KB). Mostly invisible outside file system. Bands and clusters track independently (per-band metadata). Bands can come and go.
  • ReFS can move bands around (read/write/update band pointer). Efficient write caching and parity. Writes to bands in fast tier. Tracks heat per band. Moves bands between tiers. More efficient allocation. You can move from 100% triple mirroring to 95% parity.
  • ReFS cluster bands: small writes accumulate where writing is cheap (mirror, flash, log-structured arena), bands are later shuffled to tier where random writes are expensive (band transfers are fully sequential).
  • ReFS cluster bands: transformation. ReFS can do stuff to the data in a band (can happen in the background). Examples: band compaction (put cold bands together, squeeze out free space), band compression (decompress on read).
  • ReFS v2 summary: data cloning, data movement, data transformation. Smart when smart makes sense, switches to dumb when dumb is better. Takes advantages of hardware combinations. And lots of other stuff…


Innovator, Disruptor or Laggard, Where Will Your Storage Applications Live? Next Generation Storage
Bev Crair, Vice President and General Manager, Storage Group, Intel

  • The world is changing: information growth,  complexity, cloud, technology.
  • Growth: 44ZB of data in all systems. 15% of the data is stored, since perceived cost is low.
  • Every minute of every day: 2013 : 8h of of video uploaded to YouTube, 47,000 apps downloaded, 200 million e-mails
  • Every minute of every day: 2015 : 300h of of video uploaded to YouTube, 51,000 apps downloaded, 204 million e-mails
  • Data never sleeps: the internet in real time. tiles showing activities all around the internet.
  • Data use pattern changes: sense and generate, collect and communicate, analyze and optimize. Example: HADRON collider
  • Data use pattern changes: from collection to analyzing data, valuable data now reside outside the organization, analyzing and optimizing unstructured data
  • Cloud impact on storage solutions: business impact, technology impact. Everyone wants an easy button
  • Intelligent storage: Deduplication, real-time compression, intelligent tiering, thin provisioning. All of this is a software problem.
  • Scale-out storage: From single system with internal network to nodes working together with an external network
  • Non-Volatile Memory (NVM) accelerates the enterprise: Examples in Virtualization, Private Cloud, Database, Big Data and HPC
  • Pyramid: CPU, DRAM, Intel DIMM (3D XPoint), Intel SSD (3D XPoint), NAND SSD, HDD,  …
  • Storage Media latency going down dramatically. With NVM, the bottleneck is now mostly in the software stack.
  • Future storage architecture: complex chart with workloads for 2020 and beyond. New protocols, new ways to attach.
  • Intel Storage Technologies. Not only hardware, but a fair amount of software. SPDK, NVMe driver, Acceleration Library, Lustre, others.
  • Why does faster storage matter? Genome testing for cancer takes weeks, and the cancer mutates. Genome is 10TB. If we can speed up the time it takes to test it to one day, it makes a huge difference and you can create a medicine that saves a person’s life. That’s why it matters.


The Long-Term Future of Solid State Storage
Jim Handy, General Director, Objective Analysis

  • How we got here? Why are we in the trouble we’re at right now? How do we get ahead of it? Where is it going tomorrow?
  • Establishing a schism: Memory is in bytes (DRAM, Cache, Flash?), Storage is in blocks (Disk Tape, DVD, SAN, NAS, Cloud, Flash)
  • Is it really about block? Block, NAND page, DRAM pages, CPU cache lines. It’s all in pages anyway…
  • Is there another differentiator? Volatile vs. Persistent. It’s confusing…
  • What is an SSD? SSDs are nothing new. Going back to DEC Bulk Core.
  • Disk interfaces create delays. SSD vs HDD latency chart. Time scale in milliseconds.
  • Zooming in to tens of microseconds. Different components of the SSD delay. Read time, Transfer time, Link transfer, platform and adapter, software
  • Now looking at delays for MLC NAND ONFi2, ONFi3, PCIe x4 Gen3, future NVM on PCIe x4 Gen3
  • Changing the scale to tens of microseconds on future NVM. Link Transfer, Platform & adapter and Software now accounts for most of the latency.
  • How to move ahead? Get rid of the disk interfaces (PCIe, NVMe, new technologies). Work on the software: SNIA.
  • Why now? DRAM Transfer rates. Chart transfer rates for SDRAM, DDR, DDR2, DDR3, DDR4. Designing the bus takes most of the time.
  • DRAM running out of speed? We probably won’t see a DDR5. HMC or HBM a likely next step. Everything points to fixed memory sizes.
  • NVM to the rescue. DRAM is not the only upgrade path. It became cheaper to use NAND flash than DRAM to upgrade a PC.
  • NVM to be a new memory layer between DRAM & NAND: Intel/Micron 3D XPoint – “Optane”
  • One won’t kill the other. Future systems will have DRAM, NVM, NAND, HDD. None of them will go away…
  • New memories are faster than NAND. Chart with read bandwidth vs write bandwidth. Emerging NVRAM: FeRAM, eMRAM, RRAM, PRAM.
  • Complex chart with emerging research memories. Clock frequency vs. Cell Area (cost).
  • The computer of tomorrow. Memory or storage? In the beginning (core memory), there was no distinction between the two.
  • We’re moving to an era where you can turn off the computer, turn it back on and there’s something in memory. Do you trust it?
  • SCM – Storage Class Memory: high performance with archival properties. There are many other terms for it: Persistent Memory, Non-Volatile Memory.
  • New NVM has disruptively low latency: Log chart with latency budgets for HDD, SATA SSD, NVMe, Persistent. When you go below 10 microseconds (as Persistent does), context switching does not make sense.
  • Non-blocking I/O. NUMA latencies up to 200ns have been tolerated. Latencies below these cause disruption.
  • Memory mapped files eliminate file system latency.
  • The computer of tomorrow. Fixed DRAM size, upgradeable NVM (tomorrow’s DIMM), both flash and disk (flash on PCIe or own bus), much work needed on SCM software
  • Q: Will all these layers survive? I believe so. There are potential improvements in all of them (cited a few on NAND, HDD).
  • Q: Shouldn’t we drop one of the layers? Usually, adding layers (not removing them) is more interesting from a cost perspective.
  • Q: Do we need a new protocol for SCM? NAND did well without much of that. Alternative memories could be put on a memory bus.


Concepts on Moving From SAS connected JBOD to an Ethernet Connected JBOD
Jim Pinkerton, Partner Architect Lead, Microsoft

  • What if we took a JBOD, a simple device, and just put it on Ethernet?
  • Re-Thinking the Software-defined Storage conceptual model definition: compute nodes, storage nodes, flakey storage devices
  • Front-end fabric (Ethernet, IB or FC), Back-end fabric (directly attached or shared storage)
  • Yesterday’s Storage Architecture: Still highly profitable. Compute nodes, traditional SAN/NAS box (shipped as an appliance)
  • Today: Software Defined Storage (SDS) – “Converged”. Separate the storage service from the JBOD.
  • Today: Software Defined Storage (SDS) – “Hyper-Converged” (H-C). Everything ships in a single box. Scale-out architecture.
  • H-C appliances are a dream for the customer to install/use, but the $/GB storage is high.
  • Microsoft Cloud Platform System (CPS). Shipped as a packaged deal. Microsoft tested and guaranteed.
  • SDS with DAS – Storage layer divided into storage front-end (FE) and storage back-end (BE). The two communicate over Ethernet.
  • SDS Topologies. Going from Converged and Hyper-Converged to a future EBOD topology. From file/block access to device access.
  • Expose the raw device over Ethernet. The raw device is flaky, but we love it. The storage FE will abstract that, add reliability.
  • I would like to have an EBOD box that could provide the storage BE.
  • EBOD works for a variety of access protocols and topologies. Examples: SMB3 “block”, Lustre object store, Ceph object store, NVMe fabric, T10 objects.
  • Shared SAS Interop. Nightmare experience (disk multi-path interop, expander multi-path interop, HBA distributed failure).  This is why customers prefers appliances.
  • To share or not to share. We want to share, but we do not want shared SAS. Customer deployment is more straightforward, but you have more traffic on Ethernet.
  • Hyper-Scale cloud tension – fault domain rebuild time. Depends on number of disks behind a node and how much network you have.
  • Fault domain for storage is too big. Required network speed offsets cost benefits of greater density. Many large disks behind a single node becomes a problem.
  • Private cloud tension – not enough disks. Entry points at 4 nodes, small number of disks. Again, fault domain is too large.
  • Goals in refactoring SDS – Storage back-end is a “data mover” (EBOD). Storage front-end is “general purpose CPU”.
  • EBOD goals – Can you hit a cost point that’s interesting? Reduce storage costs, reduce size of fault domain, build a more robust ecosystem of DAS. Keep topology simple, so customer can build it themselves.
  • EBOD: High end box, volume box, capacity box.
  • EBOD volume box should be close to what a JBOD costs. Basically like exposing raw disks.
  • Comparing current Hyper-Scale to EBOD. EBOD has an NIC and an SOC, in addition to the traditional expander in a JBOD.
  • EBOD volume box – Small CPU and memory, dual 10GbE, SOC with RDMA NIC/SATA/SAS/PCIe, up to 20 devices, SFF-8639 connector, management (IPMI, DMTF Redfish?)
  • Volume EBOD Proof Point – Intel Avaton, PCIe Gen 2, Chelsio 10GbE, SAS HBA, SAS SSD. Looking at random read IOPS (local, RDMA remote and non-RDMA remote). Max 159K IOPS w/RDMA, 122K IOPS w/o RDMA. Latency chart showing just a few msec.
  • EBOD Performance Concept – Big CPU, Dual attach 40GbE, Possibly all NVME attach or SCM. Will show some of the results this afternoon.
  • EBOD is an interesting approach that’s different from what we’re doing. But it’s nicely aligned with software-defined storage.
  • Price point of EBOD must be carefully managed, but the low price point enables a smaller fault domain.


Planning for the Next Decade of NVM Programming
Andy Rudoff, SNIA NVM Programming TWG, Intel

  • Looking at what’s coming up in the next decade, but will start with some history.
  • Comparison of data storage technologies. Emerging NV technologies with read times in the same order of magnitude as DRAM.
  • Moving the focus to software latency when using future NVM.
  • Is it memory or storage? It’s persistent (like storage) and byte-addressable (like memory).
  • Storage vs persistent memory. Block IO vs. byte addressable, sync/async (DMA master)  vs. sync (DMA slave). High capacity vs. growing capacity.
  • pmem: The new Tier. Byte addressable, but persistent. Not NAND. Can do small I/O. Can DMA to it.
  • SNIA TWG (lots of companies). Defining the NVM programming model: NVM.PM.FILE mode and NVM.PM.VOLUME mode.
  • All the OSes created in the last 30 years have a memory mapped file.
  • Is this stuff real? Why are we spending so much time on this? Yes – Intel 3D XPoint technology, the Intel DIMM. Showed a wafer on stage. 1000x faster than NAND. 1000X endurance of NAND, 10X denser than conventional memory. As much as 6TB of this stuff…
  • Timeline: Big gap between NAND flash memory (1989) and 3D XPoint (2015).
  • Diagram of of the model with Management, Block, File and Memory access. Link at the end to the diagram.
  • Detecting pmem: Defined in the ACPI 6.0. Linux support upstream (generic DIMM driver, DAX, ext4+DAX, KVM).  Neal talked about Windows support yesterday.
  • Heavy OSV involvement in TWG, we wrote the spec together.
  • We don’t want every application to have to re-architecture itself. That’s why we have block and file there as well.
  • The next decade
  • Transparency  levels: increasing barrier to adoption. increasing leverage. Could do it in layers. For instance, could be file system only, without app modification. For instance, could modify just the JVM to get significant advantages without changing the apps.
  • Comparing to multiple cores in hardware and multi-threaded programming. Took a decade or longer, but it’s commonplace now.
  • One transparent example: pmem Paging. Paging from the OS page cache (diagrams).
  • Attributes of paging : major page faults, memory looks much larger, page in must pick a victim, many enterprise apps opt-out, interesting example: Java GC.
  • What would it look like if you paged to pmem instead of paging to storage. I don’t even care that it’s persistent, just that there’s a lot of it.
  • I could kick a page out synchronously, probably faster than a context switch. But the app could access the data in pmem without swapping it in (that‘s new!). Could have policies for which app lives in which memory. The OS could manage that, with application transparency.
  • Would this really work? It will when pmem costs less, performance is close, capacity is significant and it is reliable. “We’re going to need a bigger byte” to hold error information.
  • Not just for pmem. Other memories technologies are emerging. High bandwidth memory, NUMA localities, different NVM technologies.
  • Extending into user space: NVM Library – pmem.io (64-bit Linux Alpha release). Windows is working on it as well.
  • That is a non-transparent example. It’s hard (like multi-threading). Things can fail in interesting new ways.
  • The library makes it easier and some of it is transactional.
  • No kernel interception point, for things like replication. No chance to hook above or below the file system. You could do it in the library.
  • Non-transparent use cases: volatile caching, in-memory database, storage appliance write cache, large byte-addressable data structures (hash table, dedup), HPC (checkpointing)
  • Sweet spots: middleware, libraries, in-kernel usages.
  • Big challenge: middleware, libraries. Is it worth the complexity.
  • Building a software ecosystem for pmem, cost vs. benefit challenge.
  • Prepare yourself: lean NVM programming model, map use cases to pmem, contribute to the libraries, software ecosystem


FS Design Around SMR: Seagate’s Journey and Reference System with EXT4
Adrian Palmer, Drive Development Engineering, Seagate Technologies

  • SNIA Tutorial. I’m talking about the standard, as opposed as the design of our drive.
  • SMR is being embraced by everyone, since this is a major change, a game changes.
  • From random writes to resemble the write profile of sequential-access tape.
  • 1 new condition: forward-write preferred. ZAC/ZBD spec: T10/13. Zones, SCSI ZBC standards, ATA ZAC standards.
  • What is a file system? Essential software on a system, structured and unstructured data, stores metadata and data.
  • Basic FS requirements: Write-in-place (superblock, known location on disk), Sequential write (journal), Unrestricted write type (random or sequential)
  • Drive parameters: Sector (atomic unit of read/write access). Typically 512B size. Independently accessed. Read/write, no state.
  • Drive parameters: Zone (atomic performant rewrite unit). Typically 256 MiB in size. Indirectly addressed via sector. Modified with ZAC/ZBD commands. Each zone has state (WritePointer, Condition, Size, Type).
  • Write Profiles. Conventional (random access), Tape (sequential access), Flash (sequential access, erase blocks), SMR HA/HM (sequential access, zones). SMR write profile is similar to Tape and Flash.
  • Allocation containers. Drive capacities are increasing, location mapping is expensive. 1.56% with 512B blocks or 0.2% with 4KB blocks.
  • Remap the block device as a… block device. Partitions (w*sector size), Block size (x*sector size), Group size (y*Block size), FS (z*group size, expressed as blocks).
  • Zones are a good fit to be matched with Groups. Absorb and mirror the metadata, don’t keep querying drive for metadata.
  • Solving the sequential write problem. Separate the problem spaces with zones.
  • Dedicate zones to each problem space: user data, file records, indexes, superblock, trees, journal, allocation containers.
  • GPT/Superblocks: First and last zone (convention, not guaranteed). Update infrequently, and at dismount. Looks at known location and WritePointer. Copy-on-update. Organized wipe and update algorithm.
  • Journal/soft updates. Update very frequently, 2 or more zones, set up as a circular buffer. Checkpoint at each zone. Wipe and overwrite oldest zone. Can be used as NV cache for metadata. Requires lots of storage space for efficient use and NV.
  • Group descriptors: Infrequently changed. Changes on zone condition change, resize, free block counts. Write cached, butwritten at WritePointer. Organized as a B+Tree, not an indexed array. The B+Tree needs to be stored on-disk.
  • File Records: POSIX information (ctime, mtime, atime, msize, fs specific attributes), updated very frequently. Allows records to be modified in memory, written to journal cache, gather from journal, write to new blocks at WritePointer.
  • Mapping (file records to blocks). File ideally written as a single chunk (single pointer), but could become fragmented (multiple pointers). Can outgrow file record space, needs its own B+Tree. List can be in memory, in the journal, written out to disk at WritePointer.
  • Data: Copy-on-write. Allocator chooses blocks at WritePointer. Writes are broken at zone boundary, creating new command and new mapping fragment.
  • Cleanup: Cannot clean up as you go, need a separate step. Each zone will have holes. Garbage collection: Journal GC, Zones GC, Zone Compaction, Defragmentation.
  • Advanced features: indexes, queries, extended attributes, snapshots, checksums/parity, RAID/JBOD.


Azure File Service: ‘Net Use’ the Cloud
David Goebel, Software Engineer, Microsoft

  • Agenda: features and API (what), scenarios enabled (why), design of an SMB server not backed by a conventional FS (how)
  • It’s not the Windows SMB server (srv2.sys). Uses Azure Tables and Azure Blobs for the actual files.
  • Easier because we already have a highly available and distributed architecture.
  • SMB 2.1 in preview since last summer. SMB 3.0 (encryption, persistent handles) in progress.
  • Azure containers mapped as shares. Clients work unmodified out-of-the-box. We implemented the spec.
  • Share namespace is coherently accessible
  • MS-SMB2, not SMB1. Anticipates (but does not require) a traditional file system on the other side.
  • In some ways it’s harder, since what’s there is not a file system. We have multiple tables (for leases, locks, etc). Nice and clean.
  • SMB is a stateful protocol, while REST is all stateless. Some state is immutable (like FileId), some state is transient (like open counts), some is maintained by the client (like CreateGuid), some state is ephemeral (connection).
  • Diagram with the big picture. Includes DNS, load balancer, session setup & traffic, front-end node, azure tables and blobs.
  • Front-end has ephemeral and immutable state. Back-end has solid and fluid durable state.
  • Diagram with two clients accessing the same file and share, using locks, etc. All the state handled by the back-end.
  • Losing a front-end node considered a regular event (happens during updates), the client simple reconnects, transparently.
  • Current state, SMB 2.1 (SMB 3.0 in the works). 5TB per share and 1TB per file. 1,000 8KB IOPS per share, 60MB/sec per share. Some NTFS features not supported, some limitations on characters and path length (due to HTTP/REST restrictions).
  • Demo: I’m actually running my talk using a PPTX file on Azure File. Robocopy to file share. Delete, watch via explorer (notifications working fine). Watching also via wireshark.
  • Currently Linux Support. Lists specific versions Ubuntu Server, Ubuntu Core, CentOS, Open SUSE, SUSE Linux Enterprise Server.
  • Why: They want to move to cloud, but they can’t change their apps. Existing file I/O applications. Most of what was written over the last 30 years “just works”. Minor caveats that will become more minor over time.
  • Discussed specific details about how permissions are currently implemented. ACL support is coming.
  • Example: Encryption enabled scenario over the internet.
  • What about REST? SMB and REST access the same data in the same namespace, so a gradual application transition without disruption is possible. REST for container, directory and file operations.
  • The durability game. Modified state that normally exists only in server memory, which must be durably committed.
  • Examples of state tiering: ephemeral state, immutable state, solid durable state, fluid durable state.
  • Example: Durable Handle Reconnect. Intended for network hiccups, but stretched to also handles front-end reconnects. Limited our ability because of SMB 2.1 protocol compliance.
  • Example: Persistent Handles. Unlike durable handles, SMB 3 is actually intended to support transparent failover when a front-end dies. Seamless transparent failover.
  • Resource Links: Getting started blog (http://blogs.msdn.com/b/windowsazurestorage/archive/2014/05/12/introducing-microsoft-azure-file-service.aspx) , NTFS features currently not supported (https://msdn.microsoft.com/en-us/library/azure/dn744326.aspx), naming restrictions for REST compatibility (https://msdn.microsoft.com/library/azure/dn167011.aspx).


Software Defined Storage - What Does it Look Like in 3 Years?
Richard McDougall, Big Data and Storage Chief Scientist, VMware

  • How do you come up with a common, generic storage platform that serves the needs of application?
  • Bringing a definition of SDS. Major trends in hardware, what the apps are doing, cloud platforms
  • Storage workloads map. Many apps on 4 quadrants on 2 axis: capacity (10’s of Terabytes to 10’s of Petabytes) and IOPS (1K to 1M)
  • What are cloud-native applications? Developer access via API, continuous integration and deployment, built for scale, availability architected in the app, microservices instead of monolithic stacks, decoupled from infrastructure
  • What do Linux containers need from storage? Copy/clone root images, isolated namespace, QoS controls
  • Options to deliver storage to containers: copy whole root tree (primitive), fast clone using shared read-only images, clone via “Another Union File System” (aufs), leverage native copy-on-write file system.
  • Shared data: Containers can share file system within host or across hots (new interest in distributed file systems)
  • Docker storage abstractions for containers: non-persistent boot environment, persistent data (backed by block volumes)
  • Container storage use cases: unshared volumes, shared volumes, persist to external storage (API to cloud storage)
  • Eliminate the silos: converged big data platform. Diagram shows Hadoop, HBase, Impala, Pivotal HawQ, Cassandra, Mongo, many others. HDFS, MAPR, GPFS, POSIX, block storage. Storage system common across all these, with the right access mechanism.
  • Back to the quadrants based on capacity and IOPS. Now with hardware solutions instead of software. Many flash appliances in the upper left (low capacity, high IOPS). Isilon in the lower right (high capacity, low IOPS).
  • Storage media technologies in 2016. Pyramid with latency, capacity per device, capacity per host for each layer: DRAM (1TB/device, 4TB/host, ~100ns latency), NVM (1TB, 4TB, ~500ns), NVMe SSD (4TB, 48TB, ~10us), capacity SSD (16TB, 192TB, ~1ms), magnetic storage (32TB, 384TB, ~10ms), object storage (?, ?, ~1s). 
  • Back to the quadrants based on capacity and IOPS. Now with storage media technologies.
  • Details on the types of NVDIMM (NVIMM-N - Type 1, NVDIMM-F – Type 2, Type 4). Standards coming up for all of these. Needs work to virtualize those, so they show up properly inside VMs.
  • Intel 3D XPoint Technology.
  • What are the SDS solutions than can sit on top of all this? Back to quadrants with SDS solutions. Nexenta, Mentions ScaleiO, VSAN, ceph, Scality, MAPR, HDFS. Can you make one solution that works well for everything?
  • What’s really behind a storage array? The value from the customer is that it’s all from one vendor and it all works. Nothing magic, but the vendor spent a ton of time on testing.
  • Types of SDS: Fail-over software on commodity servers (lists many vendors), complexity in hardware, interconnects. Issues with hardware compatibility.
  • Types of SDS: Software replication using servers + local disks. Simpler, but not very scalable.
  • Types of SDS: Caching hot core/cold edge. NVMe flash devices up front, something slower behind it (even cloud). Several solutions, mostly startups.
  • Types of SDS: Scale-out SDS. Scalable, fault-tolerant, rolling updates. More management, separate compute and storage silos. Model used by ceph, ScaleiO. Issues with hardware compatibility. You really need to test the hardware.
  • Types of SDS: Hyper-converged SDS. Easy management, scalable, fault-tolerant, rolling upgrades. Fixed compute to storage ration. Model used by VSAN, Nutanix. Amount of variance in hardware still a problem. Need to invest in HCL verification.
  • Storage interconnects. Lots of discussion on what’s the right direction. Protocols (iSCSI, FC, FCoE, NVMe, NVMe over Fabrics), Hardware transports (FC, Ethernet, IB, SAS), Device connectivity (SATA, SAS, NVMe)
  • Network. iSCSI, iSER, FCoE, RDMA over Ethernet, NVMe Fabrics. Can storage use the network? RDMA debate for years. We’re at a tipping point.
  • Device interconnects: HCA with SATA/SAS. NVMe SSD, NVM over PCIe. Comparing iSCSI, FCoE and NVMe over Ethernet.
  • PCIe rack-level Fabric. Devices become addressable. PCIe rack-scale compute and storage, with host-to-host RDMA.
  • NVMe – The new kid on the block. Support from various vendors. Quickly becoming the all-purpose stack for storage, becoming the universal standard for talking block.
  • Beyond block: SDS Service Platforms. Back to the 4 quadrants, now with service platforms.
  • Too many silos: block, object, database, key-value, big data. Each one is its own silo with its own machines, management stack, HCLs. No sharing of infrastructure.
  • Option 1: Multi-purpose stack. Has everything we talked about, but it’s a compromise.
  • Option 2: Common platform + ecosystem of services. Richest, best-of-breed services, on a single platform, manageable, shared resources.


Why the Storage You Have is Not the Storage Your Data Needs
Laz Vekiarides, CTO and Co-founder, ClearSky Data

  • ClearSky Data is a tech company, consumes what we discussed in this conference.
  • The problem we’re trying to solve is the management of the storage silos
  • Enterprise storage today. Chart: Capacity vs. $/TB. Flash, Mid-Range, Scale-Out. Complex, costly silos
  • Describe the lifecycle of the data, the many copies you make over time, the rebuilding and re-buying of infrastructure
  • What enterprises want: buy just enough of the infrastructure, with enough performance, availability, security.
  • Cloud economics – pay only for the stuff that you use, you don’t have to see all the gear behind the storage, someone does the physical management
  • Tiering is a bad answer – Nothing remains static. How fast does hot data cool? How fast does it re-warm? What is the overhead to manage it? It’s a huge overhead. It’s not just a bandwidth problem.
  • It’s the latency, stupid. Data travels at the speed of light. Fast, but finite. Boston to San Francisco: 29.4 milliseconds of round-trip time (best case). Reality (with switches, routers, protocols, virtualization) is more like 70 ms.
  • So, where exactly is the cloud? Amazon East is near Ashburn, VA. Best case is 10ms RTT. Worst case is ~150ms (does not include time to actually access the storage).
  • ClearSky solution: a global storage network. The infrastructure becomes invisible to you, what you see is a service level agreement.
  • Solution: Geo-distributed data caching. Customer SAN, Edge, Metro POP, Cloud. Cache on the edge (all flash), cache on the metro POP.
  • Edge to Metro POP are private lines (sub millisecond latency). Addressable market is the set of customers within a certain distance to the Metro POP.
  • Latency math: Less than 1ms to the Metro POP, cache miss path is between 25ms and 50ms.
  • Space Management: Edge (hot, 10%, 1 copy), POP (warm, <30%, 1-2 copies), Cloud (100%, n copies). All data is deduplicated and encrypted.
  • Modeling cache performance: Miss ratio curve (MRC). Performance as f(size), working set knees, inform allocation policy.
  • Reuse distance (unique intervening blocks between use and reuse). LRU is most of what’s out there. Look at stacking algorithms. Chart on cache size vs. miss ratio. There’s a talk on this tomorrow by CloudPhysics.
  • Worked with customers to create a heat map data collector. Sizing tool for VM environments. Collected 3-9 days of workload.
  • ~1,400 virtual disks, ~800 VMs, 18.9TB (68% full), avg read IOPS 5.2K, write IOPS 5.9K. Read IO 36KB, write IO 110KB. Read Latency 9.7ms, write latency 4.5ms.
  • This is average latency, maximum is interesting, some are off the chart. Some were hundred of ms, even 2 second.
  • Computing the cache miss ratio. How much cache would we need to get about 90% hit ratio? Could do it with less than 12% of the total.
  • What is cache hit for writes? What fits in the write-back cache. You don’t want to be synchronous with the cloud. You’ll go bankrupt that way.
  • Importance of the warm tier. Hot data (Edge, on prem, SSD) = 12%, warm data (Metro PoP, SSD and HDD) = 6%, cold data (Cloud) = 82%. Shown as a “donut”.
  • Yes, this works! We’re having a very successful outcome with the customers currently engaged.
  • Data access is very tiered. Small amounts of flash can yield disproportionate performance benefits. Single tier cache in front of high latency storage can’t work. Network latency is as important as bounding media latency.
  • Make sure your caching is simple. Sometimes you are overthinking it.
  • Identifying application patterns is hard. Try to identify the sets of LBA that are accessed. Identify hot spots, which change over time. The shape of the miss ratio remains similar.


Emerging Trends in Software Development
Donnie Berkholz, Research Director, 451 Research

  • How people are building applications. How storage developers are creating and shipping software.
  • Technology adoption is increasingly bottom-up. Open source, cloud. Used to be like building a cathedral, now it’s more like a bazaar.
  • App-dev workloads are quickly moving to the cloud. Chart from all-on-prem at the top to all-cloud at the bottom.
  • All on-prem going from 59% now to 37% in a few years. Moving to different types of clouds (private cloud, Public cloud (IaaS), Public cloud (SaaS).
  • Showing charts for total data at organization, how much in off-premises cloud (TB and %). 64% of people have less than 20% on the cloud.
  • The new stack. There’s a lot of fragmentation. 10 languages in the top 80%. Used to be only 3 languages. Same thing for databases. It’s more composable, right tool for the right job.
  • No single stack. An infinite set of possibilities.
  • Growth in Web APIs charted since 2005 (from ProgrammableWeb). Huge growth.
  • What do enterprises think of storage vendors. Top vendors. People not particularly happy with their storage vendors. Promise index vs. fulfillment index.
  • Development trends that will transform storage.
  • Containers. Docker, docker, docker. Whale logos everywhere. When does it really make sense to use VMs or containers? You need lots of random I/O for these to work well. 10,000 containers in a cluster? Where do the databases go?
  • Developers love Docker. Chart on configuration management GitHub totals (CFEngine, Puppet, Chef, Ansible, Salt, Docker). Shows developer adoption. Docker is off the charts.
  • It’s not just a toy. Survey of 1,000 people on containers. Docker is only 2.5 years old now. 20% no plans, 56% evaluating. Total doing pilot or more add up to 21%. That’s really fast adoption
  • Docker to microservices.
  • Amazon: “Every single data transfer between teams has to happen through an API or you’re fired”. Avoid sending spreadsheets around.
  • Microservices thinking is more business-oriented, as opposed to technology-oriented.
  • Loosely couple teams. Team organization has a great influence in your development.
  • The foundation of microservices. Terraform, MANTL, Apache Mesos, Capgemini Appollo, Amazon EC2 Container Service.
  • It’s a lot about scheduling. Number of schedulers that use available resources. Makes storage even more random.
  • Disruption in data processing. Spark. It’s a competitor to Hadoop, really good at caching in memory, also very fast on disk. 10x faster than map-reduce. People don’t have to be big data experts. Chart: Spark came out of nowhere (mining data from several public forums).
  • The market is coming. Hadoop market as a whole growing 46% (CAGR).
  • Storage-class memory. Picture of 3D XPoint. Do app developer care? Not sure. Not many optimize for cache lines in memory. Thinking about Redis in-memory database for caching. Developers probably will use SCM that way. Caching in the order of TB instead of GB.
  • Network will be incredibly important. Moving bottlenecks around.
  • Concurrency for developers. Chart of years vs. Percentage of Ohlon. Getting near to 1%. That’s a lot single the most popular is around 10%.
  • Development trends
  • DevOps. Taking agile development all the way to production. Agile, truly tip to tail. You want to iterate while involving your customers. Already happening with startups, but how do you scale?
  • DevOps: Culture, Automation (Pets vs. Cattle), Measurement
  • Automation: infrastructure as code. Continuous delivery.
  • Measurement: Nagios, graphite, Graylog2, splunk, Kibana, Sensu, etsy/statsd
  • DevOps is reaching DBAs. #1 stakeholder in recent survey.
  • One of the most popular team structure change. Dispersing the storage team.
  • The changing role of standards
  • The changing role of benchmarks. Torturing databases for fun and profit.
  • I would love for you to join our panel. If you fill our surveys, you get a lot of data for free.


Learnings from Nearly a Decade of Building Low-cost Cloud Storage
Gleb Budman, CEO, Backblaze

  • What we learned, specifically the cost equation
  • 150+ PB of customer data. 10B files.
  • In 2007 we wanted to build something that would backup your PC/Mac data to the cloud. $5/month.
  • Originally we wanted to put it all on S3, but we would lose money on every single customer.
  • Next we wanted to buy SANs to put the data on, but that did not make sense either.
  • We tried a whole bunch of things. NAS, USB-connected drives, etc.
  • Cloud storage has a new player, with a shockingly low price: B2. One fourth of the cost of S3.
  • Lower than Glacier, Nearline, S3-Infrequent Access, anything out there. Savings here add up.
  • Datacenter: convert kilowatts-to-kilobits
  • Datacenter Consideration: local cost to power, real state, taxes, climate, building/system efficiency, proximity to good people, connectivity.
  • Hardware: Connect hard drives to the internet, with as little as possible in between.
  • Blackblaze storage box, costs about $3K. As simple as possible, don’t make the hardware itself redundant. Use commodity parts (example: desktop power supply), use consumer hard drives, insource & use math for drive purchases.
  • They told us we could not use consumer hard drives. But reality is that the failure rate was actually lower. They last 6 years on average. Even if the enterprise HDD never fail, they still don’t make sense.
  • Insource & use math for drive purchases. Drives are the bulk of the cost. Chart with time vs. price per gigabyte. Talking about the Thailand Hard Drive Crisis.
  • Software: Put all intelligence here.
  • Blackblaze vault: 20 hard drives create 1 tome that share parts of a file, spread across racks.
  • Avoid choke point. Every single storage pods is a first class citizen. We can parallelize.
  • Algorithmically monitor SMART stats. Know which SMART codes correlate to annual failure rate. All the data is available on the site (all the codes for all the drives). https://www.backblaze.com/SMART
  • Plan for silent corruption. Bad drive looks exactly like a good drive.
  • Put replication above the file system.
  • Run out of resources simultaneous. Hardware and software together. Avoid having CPU pegged and your memory unused. Have your resources in balance, tweak over time.
  • Model and monitor storage burn. It’s important not to have too much or too little storage. Leading indicator is not storage, it’s bandwidth.
  • Business processes. Design for failure, but fix failures quickly. Drives will die, it’s what happens at scale.
  • Create repeatable repairs. Avoid the need for specialized people to do repair. Simple procedures: either swap a drive or swap a pod. Requires 5 minutes of training.
  • Standardize on the pod chassis. Simplifies so many things…
  • Use ROI to drive automation. Sometimes doing things twice is cheaper than automation. Know when it makes sense.
  • Workflow for storage buffer. Treat buffer in days, not TB. Model how many days of space available you need. Break into three different buffer types: live and running vs. in stock but not live vs. parts.
  • Culture: question “conventional wisdom”. No hardware worshippers. We love our red storage boxes, but we are a software team.
  • Agile extends to hardware. Storage Pod Scrum, with product backlog, sprints, etc.
  • Relentless focus on cost: Is it required? Is there a comparable lower cost option? Can business processes work around it? Can software work around it?


f4: Facebook’s Warm BLOB Storage System
Satadru Pan, Software Engineer, Facebook

  • White paper “f4: Facebook’s Warm BLOB Storage System” at http://www-bcf.usc.edu/~wyattllo/papers/f4-osdi14.pdf
  • Looking at how data cools over time. 100x drop in reads in 60 days.
  • Handling failure. Replication: 1.2 * 3 = 3.6. To lose data we need to lose 9 disks or 3 hosts. Hosts in different racks and datacenters.
  • Handling load. Load spread across 3 hosts.
  • Background: Data serving. CDN protects storage, router abstracts storage, web tier adds business logic.
  • Background: Haystack [OSDI2010]. Volume is a series of blobs. In-memory index.
  • Introducing f4: Haystack on cells. Cells = disks spread over a set of racks. Some compute resource in each cell. Tolerant to disk, host, rack or cell failures.
  • Data splitting: Split data into smaller blocks. Reed Solomon encoding, Create stripes with 5 data blocks and 2 parity blocks.
  • Blobs laid out sequentially in a block. Blobs do not cross block boundary. Can also rebuild blob, might not need to read all of the block.
  • Each stripe in a different rack. Each block/blob split into racks. Mirror to another cell. 14 racks involved.
  • Read. Router does Index read, Gets physical location (host, filename, offset). Router does data read. If data read fails, router sends request to compute (decoders).
  • Read under datacenter failure. Replica cell in a different data center. Router proxies read to a mirror cell.
  • Cross datacenter XOR. Third cell has a byte-by-byte XOR of the first two. Now mix this across 3 cells (triplet). Each has 67% data and 33% replica. 1.5 * 1.4 = 2.1X.
  • Looking at reads with datacenter XOR. Router sends two read requests to two local routers. Builds the data from the reads from the two cells.
  • Replication factors: Haystack with 3 copies (3.6X), f4 2.8 (2.8X), f4 2.1 (2.1X). Reduced replication factor, increased fault tolerance, increase load split.
  • Evaluation. What and how much data is “warm”?
  • CDN data: 1 day, 0.5 sampling. BLOB storage data: 2 week, 0.1%, Random distribution of blobs assumed, the worst case rates reported.
  • Hot data vs. Warm data. 1 week – 350 reads/sec/disk, 1 month – 150r/d/s, 3 months – 70r/d/s, 1 year 20r/d/s. Wants to keep above 80 reads/sec/disk. So chose 3 months as divider between hot and warm.
  • It is warm, not cold. Chart of blob age vs access. Even old data is read.
  • F4 performance: most loaded disk in cluster: 35 reads/second. Well below the 80r/s threshold.
  • F4 performance: latency. Chart of latency vs. read response. F4 is close to Haystack.
  • Conclusions. Facebook blob storage is big and growing. Blobs cool down with age very rapidly. 100x drop in reads in 60 days. Haystack 3.6 replication over provisioning for old, warm data. F4 encodes data to lower replication to 2.1X, without compromising performance significantly.


Pelican: A Building Block for Exascale Cold Data Storage
Austin Donnelly, Principal Research Software Development Engineer, Microsoft

  • White paper “Pelican: A building block for exascale cold data storage” at http://research.microsoft.com/pubs/230697/osdi2014-Pelican.pdf
  • This is research, not a product. No product announcement here. This is a science project that we offer to the product teams.
  • Background: Cold data in the cloud. Latency (ms. To hours) vs. frequency of access. SSD, 15K rpm HDD, 7.2K rpm HDD, Tape.
  • Defining hot, warm, archival tiers. There is a gap between warm and archival. That’s were Pelican (Cold) lives.
  • Pelican: Rack-scale co-design. Hardware and software (power, cooling, mechanical, HDD, software). Trade latency for lower cost. Massive density, low per-drive overhead.
  • Pelican rack: 52U, 1152 3.5” HDD. 2 servers, PCIe bus stretched rack wide. 4 x 10Gb links. Only 8% of disks can spin.
  • Looking at pictures of the rack. Very little there. Not many cables.
  • Interconnect details. Port multiplier, SATA controller, Backplane switch (PCIe), server switches, server, datacenter network. Showing bandwidth between each.
  • Research challenges: Not enough cooling, power, bandwidth.
  • Resource use: Traditional systems can have all disks running at once. In Pelican, a disk is part of a domain: power (2 of 16), cooling (1 of 12), vibration (1 of 2), bandwidth (tree).
  • Data placement: blob erasure-encoded on a set of concurrently active disks. Sets can conflict in resource requirement.
  • Data placement: random is pretty bad for Pelican. Intuition: concentrate conflicts over a few set of disks. 48 groups of 24 disk. 4 classes of 12 fully-conflicting groups. Blob storage over 18 disks (15+3 erasure coding).
  • IO scheduling: “spin up is the new seek”. All our IO is sequential, so we only need to optimize for spin up. Four schedulers, with 12 groups per scheduler, only one active at a time.
  • Naïve scheduler: FIFO. Pelican scheduler: request batching – trade between throughput and fairness.
  • Q: Would this much spin up and down reduce endurance of the disk. We’re studying it, not conclusive yet, but looking promising so far.
  • Q: What kind of drive? Archive drives, not enterprise drives.
  • Demo. Showing system with 36 HBAs in device manager. Showing Pelican visualization tool. Shows trays, drives, requests. Color-coded for status.
  • Demo. Writing one file: drives spin up, request completes, drives spin down. Reading one file: drives spin up, read completes, drives spin down.
  • Performance. Compare Pelican to a mythical beast. Results based on simulation.
  • Simulator cross-validation. Burst workload.
  • Rack throughput. Fully provisioned vs. Pelican vs. Random placement. Pelican works like fully provisioned up to 4 requests/second.
  • Time to first byte. Pelican adds spin-up time (14.2 seconds).
  • Power consumption. Comparing all disks on standby (1.8kW) vs. all disks active (10.8kW) vs. Pelican (3.7kW).
  • Trace replay: European Center for Medium-range Weather Forecast. Every request for 2.4 years. Run through the simulator. Tiering model. Tiered system with Primary storage, cache and pelican.
  • Trace replay: Plotting highest response time for a 2h period. Response time was not bad, simulator close to the rack.
  • Trace replay: Plotting deepest queues for a 2h period. Again, simulator close to the rack.
  • War stories. Booting a system with 1152 disks (BIOS changes needed). Port multiplier – port 0 (firmware change needed). Data model for system (serial numbers for everything). Things to track: slots, volumes, media.


Torturing Databases for Fun and Profit
Mai Zheng, Assistant Professor Computer Science Department - College of Arts and Sciences, New Mexico State University

  • White paper “Torturing Databases for Fun and Profit” at https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-zheng_mai.pdf
  • Databases are used to store important data. Should provide ACID properties: atomicity, consistency, isolation, durability – even under failures.
  • List of databases that passed the tests: <none>. Everything is broken under simulated power faults.
  • Power outages are not that uncommon. Several high profile examples shown.
  • Fault model: clean termination of I/O stream. Model does not introduce corruption/dropping/reorder.
  • How to test: Connect database to iSCSI target, then decouple the database from the iSCSI target.
  • Workload example. Key/value table. 2 threads, 2 transactions per thread.
  • Known initial state, each transaction updates N random work rows and 1 meta row. Fully exercise concurrency control.
  • Simulates power fault during our workload. Is there any ACID violation after recovery? Found atomicity violation.
  • Capture I/O trace without kernel modification. Construct a post-fault disk image. Check the post-fault DB.
  • This makes testing different fault points easy. But enhanced it with more context, to figure out what makes some fault points special.
  • With that, five patterns found. Unintended update to the mmap’ed blocks. Pattern-based ranking of where fault injections will lead to pattern.
  • Evaluated 8 databases (open source and commercial). Not a single database could survive.
  • The most common violation was durability. Some violations are difficult to trigger, but the framework helped.
  • Case study: A TokyoCabinet Bug. Looking at the fault and why the database recovery did not work.
  • Pattern-based fault injection greatly reduced test points while achieving similar coverage.
  • Wake up call: Traditional testing methodology may not be enough for today’s complex storage systems.
  • Thorough testing requires purpose-built workloads and intelligent fault injection techniques.
  • Different layers in the OS can help in different ways. For instance, iSCSI is an ideal place for fault injection.
  • We should bridge the gaps in understanding and assumptions. For instance, durability might not be provided by the default DB configuration.


Personal Cloud Self-Protecting Self-Encrypting Storage Devices
Robert Thibadeau, Ph.D., Scientist and Entrepreneur, CMU, Bright Plaza

  • This talk is about personal devices, not enterprise storage.
  • The age of uncontrolled data leaks. Long list of major hacks recently. All phishing initiated.
  • Security ~= Access Control.  Security should SERVE UP privacy.
  • Computer security ~= IPAAAA, Integrity, Private, Authentication, Authorization, Audit, Availability. The first 3 are encryption, the other aren’t.
  • A storage device is a computing device. Primary host interface, firmware, special hardware functions, diagnostic parts, probe points.
  • For years, there was a scripting language inside the drives.
  • TCG Core Spec. Core (Data Structures, Basic Operations) + Scripting (Amazing use cases).
  • Security Provider: Admin, Locking, Clock, Forensic Logging, Crypto services, internal controls, others.
  • What is an SED (Self-Encrypting Device)? Drive Trust Alliance definition: Device uses built-in hardware encryption circuits to read/write data in/out of NV storage.
  • At least one Media Encryption Key (MEK) is protected by at least one Key Encryption Key (KEK, usually a “password”).
  • Self-Encrypting Storage. Personal Storage Landscape. People don’t realize how successful it is.
  • All self-encrypting today: 100% of all SSDs, 100% of all enterprise storage (HDD, SSD, etc), all iOS devices, 100% of WD USB HDDs,
  • Much smaller number of personal HDDs are Opal or SED. But Microsoft Bitlocker supports “eDrive” = Opal 2.0 drives of all kinds.
  • You lose 40% of performance of a phone if you’re doing software encryption. You must do it in hardware.
  • Working on NVM right now.
  • Drive Trust Alliance: sole purpose to facilitate adoption of Personal SED. www.drivetrust.org
  • SP-SED Rule 1 – When we talk about cloud things, every personal device is actually in the cloud so… Look in the clouds for what should be in personal storage devices.
  • TCG SED Range. Essentially partitions in the storage devices that have their own key. Bitlocker eDrive – 4 ranges. US Government uses DTA open source for creating resilient PCs using ranges. BYOD and Ransomware protection containers.
  • Personal Data Storage (PDS). All data you want to protect can be permitted to be queried under your control.
  • Example: You can ask if you are over 21, but not what your birthday is or how old you are, although data is in your PDS.
  • MIT Media Lab, OpenPDS open source offered by Kerberos Consortium at MIT.
  • Homomorphic Encryption. How can you do computing operations on encrypted data without ever decrypting the data. PDS: Ask questions without any possibility of getting at the data.
  • It’s so simple, but really hard to get your mind wrapped around it. The requests come encrypted, results are encrypted and you can never see the plaintext over the line.
  • General solution was discovered but it is not computationally infeasible (like Bitcoin). Only in the last few years (2011) it improved.
  • HE Cloud Model and SP-DED Model. Uses OAuth. You can create personal data and you can get access to questions to your personal data. No plain text.
  • Solution for Homomorphic Encryption. Examples – several copies of the data. Multiple encryption schemes. Each operation (Search, Addition, Multiplication) uses a different scheme.
  • There’s a lot of technical work on this now. Your database will grow a lot to accommodate these kinds of operations.
  • SP-SED Rule 2 – Like the internet cloud: if anybody can make money off an SP-SED, then people get really smart really fast… SP-SED should charge $$ for access to the private data they protect.
  • The TCG Core Spec was written with this in mind. PDS and Homomorphic Encryption provide a conceptual path.
  • Challenges to you: The TCG Core was designed to provide service identical to the Apple App Store, but in Self-Protecting Storage devices. Every personal storage device should let the owner of the device make money off his private data on it.


Hitachi Data Systems - Security Directions and Trends
Eric Hibbard, Chair SNIA Security Technical Working Group, CTO Security and Privacy HDS

  • Protecting critical infrastructure. No agreement on what is critical.
  • What are the sections of critical infrastructure (CI)? Some commonality, but no agreement. US=16 sectors, CA=10, EU=12, UK=9, JP=10.
  • US Critical Infrastructure. Less than 20% controlled by the government. Significant vulnerabilities. Good news is that cybersecurity is a focus now. Bad news: a lot of interdependencies (lots of things depend on electric power).
  • Threat landscape for CI. Extreme weather, pandemics, terrorism, accidents/technical failures, cyber threats.
  • CI Protection – Catapulted to the forefront. Several incidents, widespread concern, edge of cyber-warfare, state-sponsored actions.
  • President Obama declared a National Emergency on 04/01/2015 due to rising number of cyberattacks.
  • CI protection initiatives. CI Decision-making organizations, CIP decisions. CIP decision-support system. The goal is to learn from attacks, go back and analyze what we could have done better.
  • Where is the US public sector going? Rethinking strategy, know what to protect, understand value of information, beyond perimeter security, cooperation.
  • Disruptive technologies:  Mobile computing, cloud computing, machine-to-machine, big data analytics, industrial internet, Internet of things, Industry 4.0, software defined “anything”. There are security and privacy issues for each. Complexity compounded if used together.
  • M2M maturity. Machine-to-machine communication between devices that are extremely intelligent, maybe AI.
  • M2M analytics building block. Big Data + M2M. This is the heart and soul of smart cities. This must be secured.
  • IoT. 50 billion connected objects expected by 2020. These will stay around for a long time. What if they are vulnerable and inside a wall?
  • IoT will drive big data adoption. Real time and accurate data sensing. They will know where you are at any point in time.
  • CI and emerging technology. IoT helps reduce cost, but it increases risks.
  • Social Infrastructure (Hitachi View). Looking at all kinds of technologies and their interplay. It requires collaborative system.
  • Securing smart sustainable cities. Complex systems, lots of IoT and cloud and big data, highly vulnerable. How to secure them?


Enterprise Key Management & KMIP: The Real Story  - Q&A with EKM Vendors
Moderator: Tony Cox, Chair SNIA Storage Security Industry Forum, Chair OASIS KMIP Technical Committee
Panelists: Tim Hudson, CTO, Cryptsoft
Nathan Turajski, Senior Product Manager, HP
Bob Lockhart, Chief Solutions Architect, Thales e-Security, Inc
Liz Townsend, Director of Business Development, Townsend Security
Imam Sheikh, Director of Product Management, Vormetric Inc

  • Goal: Q&A to explore perspective in EKM, KMIP.
  • What are the most critical concerns and barriers to adoption?
  • Some of developers that built the solution are no longer there. Key repository is an Excel spreadsheet. Need to explain that there are better key management solutions.
  • Different teams see this differently (security, storage). Need a set of requirements across teams.
  • Concern with using multiple vendors, interoperability.
  • Getting the right folks educated about basic key management, standards, how to evaluate solutions.
  • Understanding the existing solutions already implemented.
  • Would you say that the OASIS key management has progressed to a point where it can be implemented with multiple venders?
  • Yes, we have demonstrated this many times.
  • Trend to use KMIP to pull keys down from repository.
  • Different vendors excel in different areas and complex system do use multiple vendors.
  • We have seen migrations from one vendor to another. The interoperability is real.
  • KMIP has become a cost of entry. Vendors that do not implement it are being displaced.
  • It’s not just storage. Mobile and Cloud as well.
  • What’s driving customer purchasing? Is it proactive or reactive? With interoperability, where is the differentiation?
  • It’s a mix of proactive and reactive. Each vendor has different background and different strengths (performance, clustering models). There are also existing vendor relationships.
  • Organizations still buy for specific applications.
  • It’s mixed, but some customers are planning two years down the line. One vendor might not be able to solve all the problems.
  • Compliance is driving a lot of the proactive work, although meeting compliance is a low bar.
  • Storage drives a lot of it, storage encryption drives a lot of it.
  • What benefits are customers looking for when moving to KMIP? Bad guy getting to the key, good guy losing the key, reliably forget the key to erase data?
  • There’s quote a mix of priorities. The operational requirements not to disrupt operations. Assurances that a key has been destroyed and are not kept anywhere.
  • Those were all possible before. KMIP is about making those things easier to use and integrate.
  • Motivation is to follow the standard, auditing key transitions across different vendors.
  • When I look at the EU regulation, cloud computing federating key management. Is KMIP going to scale to billions of keys in the future?
  • We have vendors that work today with tens of billions of key and moving beyond that. The underlying technology to handle federation is there, the products will mature over time.
  • It might actually be trillions of keys, when you count all the applications like the smart cities, infrastructure.
  • When LDAP is fully secure and everything is encrypted. How does secure and unsecure merge?
  • Having conversations about different levels of protections for different attributes and objects.
  • What is the different from a local key management to a remote or centralized approaches?
  • There are lots of best practices in the high scale solutions (like separation of duties), and not all of them are there for the local solution.
  • I don’t like to use simple and enterprise to classify. It’s better to call them weak and strong.
  • There are scenarios where the key needs to local for some reason, but need to secure the key, maybe have a hybrid solution with a cloud component.
  • Some enterprises think in terms of individual projects, local key management. If they step back, they will see the many applications and move to centralized.
  • With the number of keys grows will we need a lot more repositories with more interop?
  • Yes. It is more and more a requirement, like in cloud and mobile.
  • Use KMIP layer to communicate between them.
  • We’re familiar with use cases? What about abuse cases? How to protect that infrastructure?
  • It goes back to not doing security by obscurity.
  • You use a standard and audit the accesses. The system will be able to audit, analyze and alert you when it sees these abuses.
  • The repository has to be secure, with two-factor authentication, real time monitoring, allow lists for who can access the system. Multiple people to control your key sets.
  • Key management is part of the security strategy, which needs to be multi-layered.
  • Simple systems and a common language is a vector for attack, but we need to do it.
  • Key management and encryption is not the end all and be all. There must be multiple layers. Firewall, access control, audit, logging, etc. It needs to be comprehensive.


Lessons Learned from the 2015 Verizon Data Breach Investigations Report
Suzanne Widup, Senior Analyst, Verizon

  • Fact based research, gleaned from case reports. Second year that we used data visualization. Report at http://www.verizonenterprise.com/DBIR/2015/
  • 2015 DBIR: 70 contributed organizations, 79,790 security incidents, 2,122 confirmed data breaches, 61 countries
  • The VERIS framework (actor – who did it, action - how they did it, asset – what was affected, attribute – how it was affected). Given away for free.
  • We can’t share all the data. But some if it publicly disclosed and it’s in a GitHub repository as JSON files. http://www.vcdb.org.
  • You can be a part of it. Vcdb.org needs volunteers – be a security hero.
  • Looking at incidents vs. breaches. Divided by industry. Some industries have higher vulnerabilities, but a part of it is due to visibility.
  • Which industries exhibit similar threat profiles? There might be other industries that look similar to yours…
  • Zooming into healthcare and other industries with similar threat profiles.
  • Threat actors. Mostly external. Less than 20% internal.
  • Threat actions. Credentials (down), RAM scrapers (up), spyware/keyloggers (down), phishing (up).
  • The detection deficit. Overall trend is still pretty depressing. The bad guys are innovating faster than we are.
  • Discovery time line (from 2015). Mostly discovered in days or less.
  • The impact of breaches. We’re were not equipped to measure impact before. This year we partnered with insurance partners. We only have 50% of what is going on here.
  • Plotting the impact of breaches. If you look at the number of incidents, it was going down. If you look at the records lost, it is growing.
  • Charting number of records (1 to 100M) vs. expected loss (US$). There is a band from optimist to pessimist.
  • The nefarious nine: misc errors, crimeware, privilege misuse, lost/stolen assets, web applications, denial of service, cyber-espionage, point of sale, payment card skimmers.
  • Looks different if you use just breaches instead of all incidents. Point of sale is higher, for instance.
  • All incidents, charted over time (graphics are fun!)
  • More charts. Actors and the nine patterns. Breaches by industry.
  • Detailed look at point of sale (highest in accommodation, entertainment and retail), crimeware, cyber-espionage (lots of phishing), insider and privilege misuse (financial motivation), lost/stolen devices, denial of service.
  • Threat intelligence. Share early so it’s actionable.
  • Phishing for hire companies (23% of recipients open phishing messages, 11% click on attachments)
  • 10 CVEs account for 97% of exploits. Pay attention to the old vulnerabilities.
  • Mobile malware. Android “wins” over iOS.
  • Two-factor authentication and patching web servers mitigates 24% of vulnerabilities each.
Comments (7)
  1. JaredCEG says:

    Any written documentation/overviews on ReFS v2, for those unable to attend?

  2. Yoshihiro Kawabata says:

    Awesome, "Storage Class Memory Support in the Windows Operating System".

  3. ryan says:

    SCM looks very exciting! Although "Features not in DAS Mode: …Bitlocker volume encryption, snapshot via VolSnap, mirrored or parity via storage spaces or dynamic disks" hopefully will be just a temporary limitation. Would love to see RDMA SOFS nodes
    serving up SCM-level performance!

  4. ryan says:

    Also anxious to see ReFS v2 in 2016 TP4! "…Small writes accumulate where writing is cheap (mirror, flash, log-structured arena), bands are later shuffled to tier where random writes are expensive (band transfers are fully sequential)." If I understand
    correctly, it sounds like it may solve the Storage Spaces tradeoff of mirroring (fast, low capacity) vs. parity (slow writes, high capacity)–writes could go to a mirrored band and then later are moved to a parity band?

  5. ryan says:

    The EBOD concept is also slick; it sounds similar in some ways to what Coraid did, exposing the raw disks over Ethernet. One note about their approach was that they used AoE (ATA over Ethernet) rather than TCP/IP, which isn’t well-suited to storage traffic,
    particularly from a latency perspective.

    EBOD on the one hand seems like simply redrawing the lines again between the various components of compute nodes and storage nodes. But EBOD could offer significant relief from the SAS requirement for JBODs with SOFS/Storage Spaces (finding SAS SSDs can feel
    like hunting for truffles… without the dog; and they’re just as expensive!). And while Storage Spaces Direct also looks like a very appealing way to work around the complexity of JBODs/SAS/expanders/etc. and boost performance with NVMe options, with different
    EBOD performance tiers, the SOFS node + EBOD model should allow more flexibility to put your storage dollar where you need it most–just buy inexpensive pizza box servers with HBAs hooked up to beefy, purpose-built EBODs, rather than buy very beefy, general-purpose
    servers with high profit margins.

  6. google says:

    Hi there, yes this piece of writing is genuinely nice and I have learned
    lot of things from it concerning blogging.

  7. Aw, this was a really good post. Taking the time and actual effort
    to create a really good article… but what
    can I say… I put things off a whole lot and never manage to get nearly anything done.

Comments are closed.

Skip to main content