Blog

Open-source triplestore battle

  • Pieter Botha
  • October 2, 2020

Image by Pieter Botha

There are many graph databases out there that support RDF: Virtuoso, GraphDB, Stardog, AnzoGraph, and RDFox, to name just a few popular ones. But if the requirements for your triplestore include open source, as it does for our CFI-funded LINCS project, then Blazegraph and Apache’s Jena Fuseki are two of your most mature options. This article compares Blazegraph and Jena Fuseki, two contenders for the LINCS graph database. Thanks to Angus Addlesee for writing an article that compared Blazegraph with commercial triplestores and inspired the testing methodology for this post.

Blazegraph

Blazegraph, previously known as Bigdata, is a great triplestore that scales to billions of triples with thousands of proven use cases. In fact, it was so good that AWS bought the Blazegraph trademark almost five years ago and hired some of its staff, including the CEO. Unfortunately, that meant that most of Blazegraph’s development experience was used to create a competing product: Amazon Neptune. Although the official releases of Blazegraph have slowed down, it still supports SPARQL 1.1 and is by no means outdated.

Fuseki

Apache’s Fuseki, along with the entire Jena project and all its plugins, is still actively developed as of October 2020. It supports the SPARQL 1.1 update and gets new features and enhancements with each new release, which takes place every quarter or so. We know that Fuseki can scale, as shown here loading the entire Wikidata dump. But what is query performance like and can it be compared to Blazegraph? Let’s find out!

The setup

Trying to have a fair competition in a matchup like this is very difficult. Different products almost always have different strengths and selective benchmarking can easily skew results. Getting one-sided results was not the intention here, but I did choose a small set of tests, as an exhaustive test suite would require a book and not an article. My testing involved loading this Olympic sports dataset with ~1.8m triples and then executing some timed SPARQL queries using the built-in web interface of both triplestores.

The Blazegraph instance is based on a September 2016 build from the 2.2.0 branch as per the Dockerfile here. This image has full-text search enabled as well as a geo index.

I used this docker file from the LINCS project to create a Fuseki instance based on the latest v3.16 release. It is a basic TDB2 configuration with a full-text index for all rdfs:label properties.

The tests were executed on an 8-series Core-i5 with SSDs and plenty of RAM. Neither triplestore was “warmed up” and queries were executed in the same order and the same number of times in an effort to keep the playing field as level as possible.

The tests

The SPARQL queries used these prefixes:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

Queries were executed twice and both results were recorded.

Loading data

It is pretty important to most projects to know how long it will take to load data into the triplestore. Since our dataset is relatively small (< 2 million triples), I was able to use the web interface of both triplestores to load the Turtle file without any issues.

Fuseki: 57s, 30.3s
Blazegraph: 57s, 21.5s

The second run was an update with the same .ttl file used in the first run. No actual changes were made to the graph.

Counting triples

The simplest of queries to just see how many triples are in the dataset.

SELECT (COUNT(*) AS ?triples)
WHERE {
  ?s ?p ?o .
}
Fuseki: 0.8s, 0.5s
Blazegraph: 0.02s, 0.01s

It looks like Blazegraph did some pre-aggregation here while loading the data.

Regex filter

SELECT DISTINCT ?name ?cityName ?seasonName
WHERE {
  ?instance walls:games ?games ;
    walls:athlete ?athlete .
  ?games dbp:location ?city ;
    walls:season ?season .
  ?city rdfs:label ?cityName .
  ?season rdfs:label ?seasonName .
  ?athlete rdfs:label ?name .
  Filter (REGEX(lcase(?name),"louis.*"))
}
Fuseki: 7.7s, 5.0s
Blazegraph: 7.0s, 4.2s

Blazegraph was consistently faster and dips ahead further with this typical query.

Full-text searching

Using the full-text index efficiently required slightly different queries because Fuseki performed very slowly unless the full-text search was the first filter.

PREFIX text: <http://jena.apache.org/text#>
SELECT DISTINCT ?name ?cityName ?seasonName
WHERE {
  ?athlete text:query ('louis*') ;
    rdfs:label ?name .
  ?instance walls:games ?games .
  ?games dbp:location ?city ;
    walls:season ?season .
  ?city rdfs:label ?cityName .
  ?season rdfs:label ?seasonName .
}
PREFIX bds: <http://www.bigdata.com/rdf/search#>
SELECT DISTINCT ?name ?cityName ?seasonName
WHERE {
  ?instance walls:games ?games ;
    walls:athlete ?athlete .
  ?games dbp:location ?city ;
    walls:season ?season .
  ?city rdfs:label ?cityName .
  ?season rdfs:label ?seasonName .
  ?athlete rdfs:label ?name .
  ?name bds:search "'louis*'" .
}
Fuseki: 0.2s, 0.1s
Blazegraph: 0.3s, 0.1s

Moving the full-text filter to the top for Blazegraph too made it perform faster than Fuseki—0.08s for the first run and 0.04s for the second run.

Complex join

PREFIX noc: <http://wallscope.co.uk/resource/olympics/NOC/>
SELECT ?genderName (COUNT(?athlete) AS ?count)
WHERE {
  ?instance walls:games ?games ;
    walls:athlete ?athlete .
  ?games dbp:location ?city .
  ?athlete foaf:gender ?gender .
  ?gender rdfs:label ?genderName .
  {
    SELECT DISTINCT ?city
    WHERE {
    ?instance walls:games ?games ;
      walls:athlete ?athlete .
    ?athlete dbo:team ?team .
    noc:SCG dbo:ground ?team .
    ?games dbp:location ?city .
    }
  }
}
GROUP BY ?genderName
Fuseki: DNF
Blazegraph: 7.0s, 6.0s

Fuseki did not manage to finish this query before the configured timeout of 10 minutes.

Federated query

This query joins a graph over the internet from dbpedia.org

SELECT ?sport ?sportName ?teamSize
WHERE {
  {
    SELECT DISTINCT ?sportName 
    WHERE {
      ?sport rdf:type dbo:Sport ;
        rdfs:label ?sportName .
    }
  }
  SERVICE <http://dbpedia.org/sparql> 
  {
    ?sport rdfs:label ?sportName ;
      dbo:teamSize ?teamSize .
  }
}
ORDER BY DESC (?teamSize)
Fuseki: 7.9s, 7.5s
Blazegraph: 0.5s, 0.4s

Summary

Test
Data load
Triple Count
Regex
Full-text
Complex
Federated
Fuseki
57s
0.8s
7.7s
0.2s
DNF
7.9s
Blazegraph
57s
0.02s
7.0s
0.1s
7.0s
0.5s

I must admit that I was somewhat surprised by the results. Blazegraph performed consistently better than Fuseki in this scenario. The complex query that Fuseki just couldn’t finish could possibly be an indexing problem. Be that as it may, Blazegraph ran that same query just fine straight out of the box. Blazegraph also beat Fuseki by more than an order of magnitude with the federated query.

One possible explanation for these one-sided results is that Blazegraph’s indexes are better configured for this dataset and I need to apply more effort to get Fuseki’s indexes optimized. Please feel free to look at the config.ttl I used to configure Fuseki and let me know in the comments if I missed an obvious optimization or if I misconfigured something.

Fuseki configuration followup

Although I tried to configure Fuseki with the simplest full-text index possible, I feared that a misconfiguration might have been the cause for the comparative disappointing performance. To rule out that possibility, and for the sake of completeness, I ran the benchmarks against default Fuseki databases created from the admin portal without any customization.

In-memory store
data load: 25s, 27s
counting: 2.8s, 1.6s
regex: 2.8s, 2.2s
full-text: not enabled
complex: DNF
federated: 8.7s, 7.4s
TDB store
data load: 43s, 30s
counting: 0.9s, 0.7s
regex: 4.5s, 3.3s
full-text: not enabled
complex: DNF
federated: 8.5s, 8.3s
TDB2 store
data load: 54s, 28s
counting: 0.7s, 0.6s
regex: 7.8s, 5.7s
full-text: not enabled
complex: DNF
federated: 7.7s, 7.7s

The results were more or less aligned with the original performance figures. So, it seems that vanilla Fuseki is just considerably slower than Blazegraph for this dataset and queries.

 

The LINCS blog is authored by everyone from LINCS participants to members of the broader community with an interest in linked data. We sometimes cross-post. Please contact lincs.project@gmail.com if you would like to contribute. To receive LINCS newsletters sign up here.

Your email address will not be published. Required fields are marked *