Parallel Graph Analytics (PGX) is a technology from Oracle Labs, and the latest version (PGX 1.2) has just been released. It contains both a DSL for writing graph algorithms that can be transparently parallelized, and, new in 1.2, an SQL-like language for pattern-matching subgraphs in a larger graph.

Like Neo4J or GraphX, it does graph analytics – represent data as a graph and run analyses and pattern matching on it. However it offers, in some cases, orders of magnitude better performance. It does that in several ways – the DSL you use to write analytics algorithms is compiled to highly parallel Java code, and the declarative SQL-like pattern matching language, PGQL, is similarly parallelized – and the runtime is highly optimized for memory footprint.

Graph analytics is really useful for all sorts of things. The obvious ones are things like your Facebook friends graph – find people you probably know because your friends know them, and that sort of thing. What’s less obvious is all the other things you can use it for. Some examples:

  • Let nodes represent people and insurance claims; look for patterns where the same people appear on both sides of several claims, filtered for geographical proximity. You’ve found an insurance fraud ring.
  • Let nodes represent Java methods that call other Java methods. Run PageRank or another centrality algorithm, and index how central – how important – each method is. When a Git commit modifies an important method, send an email to the team asking for review.
  • A recommendation engine – matrix factorization lets you take a graph of users and items they recommended, and synthesize “features” – latent categorization – that lets you predict other items a user will recommend highly – or the reverse, find users who will be interested in an item.

The point here is that, once you start doing graph analysis and get used to thinking in terms of graphs, you have this epiphany that there are all sorts of things you can learn. What graph analysis does is surface latent information that’s encoded in the structure of a set of relationships – and those relationships can be as easily parts and suppliers and products as they can be Facebook friends or other obvious things. The software industry has barely scratched the surface on the kinds of things graph analytics can be used for.

Nodes and edges in a graph have properties – key/value pairs – that can be used when computing an analysis. In PGX, running an analysis usually results in synthesizing new properties on components of the graph, and those can then be used in pattern-matching queries. So PGX really provides full-service graph analytics in a single package.

Graph analytics don’t work well in SQL databases – for a lot of typical graph-questions you’d like to answer, you’d have to JOIN one table on itself n times (and you don’t actually know the value of n except that it could be the entire row count of the table). That is something SQL databases don’t perform well at.

So, starting PGX in local mode is pretty simple – and it can handle shockingly large graphs on a laptop. You just download, install it and run


to start the interactive Groovy shell (it also has a Java API and REST API built in):

foo@bar ~/work/lib/pgx $ bin/pgx
PGX Shell 1.2.0-SNAPSHOT
type :help for available commands
02:11:11,961 [main] INFO Ctrl$2 - >>> PGX engine running.
variables instance, session and analyst ready to use

Loading a graph in a plaintext format such as edge-list or graphml is simple – you write a small JSON file that describes the schema of the graph and the format, then

pgx> graph = session.readGraphWithProperties('myGraph.json');

And you can immediately run any of the built-in algorithms on it:

pgx> analyst.countTriangles(graph, true);
==> 23

Custom analytics algorithms are written in a language called Green-Marl that treats graph elements as first-class citizens, has common operations such as breadth-first-search as (parallelizable) language-constructs, and conveniences like initializing a vector with a default value (or random values) in a single line of code.

For example, here is the classic PageRank algorithm:

procedure pagerank(G: graph, e,d: double, max_iter_count: int;
                   pg_rank: nodeProp) {
    double diff;
    int cnt = 0;
    double N = G.numNodes();
    G.pg_rank = 1 / N;
    do {
        diff = 0.0;
        foreach (t: G.nodes) {
            double val = (1-d) / N + d* 
                sum(w: t.inNbrs) {w.pg_rank / w.outDegree()} ;
            diff += | val - t.pg_rank |;
            t.pg_rank <= val; } cnt++; } while ((diff > e) && (cnt < max_iter_count));

A thing to note is that under the hood, all of the loops can be parallelized by the execution engine – this is not sequential code, though it feels like it to write it. To load and run this, you would simply do this:

pgx> program = session.compileProgram("");

Pattern matching, on the other hand, uses a declarative SQL-like language called PGQL, that allows for matching on node and edge properties has features similar to those of SQL. For example, say you believe the proverb the enemy of my enemy is my friend – and you have a graph of who is feuding with whom. This query will find, given an input node, the list of the enemies of their enemies:

pgx> resultSet = G.queryPgql("SELECT, WHERE x -[e1 WITH label = 'feuds']-> y, y -[e2 WITH label = 'feuds']-> z");

Anyway, this is too short an article to describe all of the things you can do with PGX, but hopefully this has inspired you to look deeper. You can download the PGX technology preview from Oracle Labs here. PGX is also incorporated into Oracle Big Data Spatial and Graph for commercial use.

Graph Analytics from Oracle Labs with PGX 1.2

About The Author

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>