Open-source triplestore battle
- Pieter Botha
- October 2, 2020
Image by Pieter Botha
There are many graph databases out there that support RDF: Virtuoso, GraphDB, Stardog, AnzoGraph, and RDFox, to name just a few popular ones. But if the requirements for your triplestore include open source, as it does for our CFI-funded LINCS project, then Blazegraph and Apache’s Jena Fuseki are two of your most mature options. This article compares Blazegraph and Jena Fuseki, two contenders for the LINCS graph database. Thanks to Angus Addlesee for writing an article that compared Blazegraph with commercial triplestores and inspired the testing methodology for this post.
Blazegraph
Blazegraph, previously known as Bigdata, is a great triplestore that scales to billions of triples with thousands of proven use cases. In fact, it was so good that AWS bought the Blazegraph trademark almost five years ago and hired some of its staff, including the CEO. Unfortunately, that meant that most of Blazegraph’s development experience was used to create a competing product: Amazon Neptune. Although the official releases of Blazegraph have slowed down, it still supports SPARQL 1.1 and is by no means outdated.
Fuseki
Apache’s Fuseki, along with the entire Jena project and all its plugins, is still actively developed as of October 2020. It supports the SPARQL 1.1 update and gets new features and enhancements with each new release, which takes place every quarter or so. We know that Fuseki can scale, as shown here loading the entire Wikidata dump. But what is query performance like and can it be compared to Blazegraph? Let’s find out!
The setup
Trying to have a fair competition in a matchup like this is very difficult. Different products almost always have different strengths and selective benchmarking can easily skew results. Getting one-sided results was not the intention here, but I did choose a small set of tests, as an exhaustive test suite would require a book and not an article. My testing involved loading this Olympic sports dataset with ~1.8m triples and then executing some timed SPARQL queries using the built-in web interface of both triplestores.
The Blazegraph instance is based on a September 2016 build from the 2.2.0 branch as per the Dockerfile here. This image has full-text search enabled as well as a geo index.
I used this docker file from the LINCS project to create a Fuseki instance based on the latest v3.16 release. It is a basic TDB2 configuration with a full-text index for all rdfs:label properties.
The tests were executed on an 8-series Core-i5 with SSDs and plenty of RAM. Neither triplestore was “warmed up” and queries were executed in the same order and the same number of times in an effort to keep the playing field as level as possible.
The tests
The SPARQL queries used these prefixes:
PREFIX walls: <http://wallscope.co.uk/ontology/olympics/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX dbp: <http://dbpedia.org/property/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> |
Queries were executed twice and both results were recorded.
Loading data
It is pretty important to most projects to know how long it will take to load data into the triplestore. Since our dataset is relatively small (< 2 million triples), I was able to use the web interface of both triplestores to load the Turtle file without any issues.
Fuseki: 57s, 30.3s Blazegraph: 57s, 21.5s
The second run was an update with the same .ttl file used in the first run. No actual changes were made to the graph.
Counting triples
The simplest of queries to just see how many triples are in the dataset.
SELECT (COUNT(*) AS ?triples) WHERE { ?s ?p ?o . } |
Fuseki: 0.8s, 0.5s Blazegraph: 0.02s, 0.01s
It looks like Blazegraph did some pre-aggregation here while loading the data.
Regex filter
SELECT DISTINCT ?name ?cityName ?seasonName WHERE { ?instance walls:games ?games ; walls:athlete ?athlete . ?games dbp:location ?city ; walls:season ?season . ?city rdfs:label ?cityName . ?season rdfs:label ?seasonName . ?athlete rdfs:label ?name . Filter (REGEX(lcase(?name),"louis.*")) } |
Fuseki: 7.7s, 5.0s Blazegraph: 7.0s, 4.2s
Blazegraph was consistently faster and dips ahead further with this typical query.
Full-text searching
Using the full-text index efficiently required slightly different queries because Fuseki performed very slowly unless the full-text search was the first filter.
PREFIX text: <http://jena.apache.org/text#> SELECT DISTINCT ?name ?cityName ?seasonName WHERE { ?athlete text:query ('louis*') ; rdfs:label ?name . ?instance walls:games ?games . ?games dbp:location ?city ; walls:season ?season . ?city rdfs:label ?cityName . ?season rdfs:label ?seasonName . } |
PREFIX bds: <http://www.bigdata.com/rdf/search#> SELECT DISTINCT ?name ?cityName ?seasonName WHERE { ?instance walls:games ?games ; walls:athlete ?athlete . ?games dbp:location ?city ; walls:season ?season . ?city rdfs:label ?cityName . ?season rdfs:label ?seasonName . ?athlete rdfs:label ?name . ?name bds:search "'louis*'" . } |
Fuseki: 0.2s, 0.1s Blazegraph: 0.3s, 0.1s
Moving the full-text filter to the top for Blazegraph too made it perform faster than Fuseki—0.08s for the first run and 0.04s for the second run.
Complex join
PREFIX noc: <http://wallscope.co.uk/resource/olympics/NOC/> SELECT ?genderName (COUNT(?athlete) AS ?count) WHERE { ?instance walls:games ?games ; walls:athlete ?athlete . ?games dbp:location ?city . ?athlete foaf:gender ?gender . ?gender rdfs:label ?genderName . { SELECT DISTINCT ?city WHERE { ?instance walls:games ?games ; walls:athlete ?athlete . ?athlete dbo:team ?team . noc:SCG dbo:ground ?team . ?games dbp:location ?city . } } } GROUP BY ?genderName |
Fuseki: DNF Blazegraph: 7.0s, 6.0s
Fuseki did not manage to finish this query before the configured timeout of 10 minutes.
Federated query
This query joins a graph over the internet from dbpedia.org
SELECT ?sport ?sportName ?teamSize WHERE { { SELECT DISTINCT ?sportName WHERE { ?sport rdf:type dbo:Sport ; rdfs:label ?sportName . } } SERVICE <http://dbpedia.org/sparql> { ?sport rdfs:label ?sportName ; dbo:teamSize ?teamSize . } } ORDER BY DESC (?teamSize) |
Fuseki: 7.9s, 7.5s Blazegraph: 0.5s, 0.4s
Summary
Test Data load Triple Count Regex Full-text Complex Federated |
Fuseki 57s 0.8s 7.7s 0.2s DNF 7.9s |
Blazegraph 57s 0.02s 7.0s 0.1s 7.0s 0.5s |
I must admit that I was somewhat surprised by the results. Blazegraph performed consistently better than Fuseki in this scenario. The complex query that Fuseki just couldn’t finish could possibly be an indexing problem. Be that as it may, Blazegraph ran that same query just fine straight out of the box. Blazegraph also beat Fuseki by more than an order of magnitude with the federated query.
One possible explanation for these one-sided results is that Blazegraph’s indexes are better configured for this dataset and I need to apply more effort to get Fuseki’s indexes optimized. Please feel free to look at the config.ttl I used to configure Fuseki and let me know in the comments if I missed an obvious optimization or if I misconfigured something.
Fuseki configuration followup
Although I tried to configure Fuseki with the simplest full-text index possible, I feared that a misconfiguration might have been the cause for the comparative disappointing performance. To rule out that possibility, and for the sake of completeness, I ran the benchmarks against default Fuseki databases created from the admin portal without any customization.
In-memory store data load: 25s, 27s counting: 2.8s, 1.6s regex: 2.8s, 2.2s full-text: not enabled complex: DNF federated: 8.7s, 7.4s |
TDB store data load: 43s, 30s counting: 0.9s, 0.7s regex: 4.5s, 3.3s full-text: not enabled complex: DNF federated: 8.5s, 8.3s |
TDB2 store data load: 54s, 28s counting: 0.7s, 0.6s regex: 7.8s, 5.7s full-text: not enabled complex: DNF federated: 7.7s, 7.7s |
The results were more or less aligned with the original performance figures. So, it seems that vanilla Fuseki is just considerably slower than Blazegraph for this dataset and queries.