SQL Server is trusted by many customers for enterprise-grade, mission-critical workloads that store and process large volumes of data. Technologies like in-memory OLTP and columnstore have also helped our customers to improve application performance many times over. But when it comes to hierarchical data with complex relationships or data that share multiple relationships, users might find themselves struggling with a good schema design to represent all the entities and relationships, and writing optimal queries to analyze complex data and relationships between the tables. SQL Server uses foreign keys and joins to handle relationships between entities or tables. Foreign keys only represent one-to-many relationships and hence, to model many-to-many relationships, a common approach is to introduce a table that holds such relationships. For example, Student and Course in a school share a many-to-many relationship; a Student takes multiple Courses and a Course is taken by multiple Students. To represent this kind of relationship one can create an “Attends” table to hold information about all the Courses a Student is taking. The “Attends” table can then store some extra information like the dates when a given Student took this Course, etc.
Over time applications tend to evolve and get more complex. For example, a Student can start “Volunteering” in a Course or start mentoring “Mentoring” others. This will add new types of relationships to the database. With this type of approach, it is not always easy to modify existing tables to accommodate evolving relationships. To analyze data connected by means of foreign keys or multiple junction tables involves writing complex queries with joins across multiple tables, and this is no trivial task. The queries can quickly get complex, resulting in complex execution plans and degraded query performance over time.
We live in an era of big data and connected information; people, machines, devices, businesses across the continents are connected to each other more than ever before. Analyzing connected information is becoming critical for businesses to achieve operational agility. Users are finding it easier to model data and complex relationships with the help of graph databases. Native graph databases have risen in popularity, being used for social networks, transportation networks, logistics, and much more. Graph database scenarios can easily be found across several business disciplines, including supply chain management, computer or telecommunication networks, detecting fraud attacks, and recommendation engines.
At Microsoft, we believe that there should be no need for our customers to turn to a new system just to meet their new or evolving graph database requirements. SQL Server is already trusted by millions of customers for mission-critical workloads, and with graph extensions in SQL Server 2017, customers get the best of both relational and graph databases in a single product, including the ability to query across all data using a single platform. Users can also benefit from other cutting-edge technologies already available in SQL Server, such as columnstore indexes, advanced analytics using SQL Server R Services, high availability, and more.
Graph extensions available in SQL Server 2017
A graph schema or database in SQL Server is a collection of node and edge tables. A node represents an entity—for example, a person or an organization—and an edge represents a relationship between the two nodes it connects. Figure 1 shows the architecture of a graph database in SQL Server.
Figure 1: SQL graph database architecture
Create graph objects
With the help of T-SQL extensions to DDL, users can create node or edge tables. Both nodes and edges can have properties associated to them. Users can model many-to-many relationships using edge tables. A single edge type can connect multiple type of nodes with each other, in contrast to foreign keys in relational tables. Figure 2 shows how a node and edge table are stored internally in the database. Since nodes and edges are stored as tables, most of the operations supported on tables are available on node or edge tables, too.
Figure 2: Person Node and Friends Edge table.
The CREATE TABLE syntax guide shows the supported syntax for creation of node and edge tables.
Query language extensions
To help search a pattern or traverse through the graph, a new MATCH clause is introduced that uses ASCII-art syntax for pattern matching and navigation. For example, consider the Person and Friends node tables shown in Figure 2; the following query will return friends of “John”:
FROM Person Person1, Friends, Person Person2
AND Person1.Name = ‘John’;
The MATCH clause is taking a search pattern as input. This pattern traverses the graph from one node to another via an edge. Edges appear inside parentheses and nodes appear at the ends of the arrow. Please refer to MATCH syntax guide to find out more ways in which MATCH can be used.
Fully integrated in SQL Server engine
Graph extensions are fully integrated in the SQL Server engine. Node and edge tables are just new types of tables in the database. The same storage engine, metadata, query processor, etc., is used to store and query graph data. All security and compliance features are also supported. Other cutting-edge technologies like columnstore, ML using R Services, HA, and more can also be combined with graph capabilities to achieve more. Since graphs are fully integrated in the engine, users can query across their relational and graph data in a single system.
Tooling and ecosystem
Users benefit from the existing tools and ecosystem that SQL Server offers. Tools like backup and restore, import and export, BCP, and SSMS “just work” out of the box.
How can I ingest unstructured data?
Since we are storing data in tables, users must know the schema at the time of creation. Users can always add new types of nodes or edges to their schema. But if they want to modify an existing node or edge table, they can use ALTER TABLE to add or delete attributes. If you expect any unknown attributes in your schema, you could either use sparse columns or create a column to hold JSON strings and use that as a placeholder for unknown attributes.
Do you maintain an adjacency list for faster lookups?
No. We are not maintaining an adjacency list on every node; instead we are storing edge data in tables. Because it is a relational database, storing data in the form of tables was a more natural choice for us. In native-directed graph databases with an adjacency list, you can only traverse in one direction. If you need to traverse in the reverse direction, you need to maintain an adjacency list at the remote node too. Also, with adjacency lists, in a big graph for a large query that spawns across your graph, you are essentially always doing a nested loop lookup: for every node, find all the edges, from there find all the connected nodes and edges, and so on.
Storing edge data in a separate table allows us to benefit from the query optimizer, which can pick the optimal join strategy for large queries. Depending on the complexity of query and data statistics, the optimizer can pick a nested loop join, hash join, or other join strategies — as opposed to always using nested loop join, as in the case of an adjacency list. Each edge table has two implicit columns, $from_id and $to_id, which store information about the nodes that it connects. For OLTP scenarios, we recommend that users create indexes on these columns ($from_id, $to_id) for faster lookups in the direction of the edge. If your application needs to perform traversals in reverse direction of an edge, you can create an index on ($to_id, $from_id).
Is the new MATCH syntax supported on relational tables?
No. MATCH clause works only on graph node and edge tables.
Can I alter an existing table into a node or edge table?
No. In the first release, ALTER TABLE to convert an existing relational table into a node or edge table is not supported. Users can create a node table and use INSERT INTO … SELECT FROM to populate data into the node table. To populate an edge table from an existing table, proper $from_id and $to_id values must be obtained from the node tables.
What are some table operations that are not supported on node or edge tables?
In the first release, node or edge tables cannot be created as memory-optimized, system-versioned, or temporary tables. Stretching or creating a node or edge table as external table (PolyBase) is also not supported in this release.
How do I find a node connected to me, arbitrary number of hops away, in my graph?
The ability to recurse through a combination of nodes and edges, an arbitrary number of times, is called transitive closure. For example, find all the people connected to me through three levels of indirections or find the employee chain for a given employee in an organization. Transitive closure is not supported in the first release. A recursive CTE or a T-SQL loop may be used to work around these types of queries.
How do I find ANY Node connected to me in my graph?
The ability to find any type of node connected to a given node in a graph is called polymorphism. SQL graph does not support polymorphism in the first release. A possible workaround is to write queries with UNION clause over a known set of node and edge types. However, this workaround is good for a small set of node and edge types.
Are there special graph analytics functions introduced?
Some graph databases provide dedicated graph analytical functions like “shortest path” or “page rank.” SQL Graph does not provide any such functions in this release. Again, T-SQL loops and temp tables may be used to write a workaround for these scenarios.
Thank you for reading this post! We are excited to announce the first version of graph extensions to SQL Server. To learn more, see this article on Graph processing with SQL Server 2017. Stay tuned for more blog posts and updates on SQL graph database!
Try SQL Server 2017
Get started with the preview of SQL Server 2017 on macOS, Docker, Windows, and Linux using these links: