Dylan Storey

Recovering academic, hacker, tinkerer, scientist.

The Panama Papers - Building our Database

The publication of the Panama Papers highlighted how individuals/corporations could use offshore entities to avoid taxation at home. It also happened to use Neo4J , so it provides a great data set for general exploration but also gave me an excuse to get better with another database!

Jupyter notebook available here

Retrieving Data:

wget https://cloudfront-files-1.publicintegrity.org/offshoreleaks/data-csv.zip
unzip data-csv.zip

Basic Data Cleaning :

There were some odd escaped quotes in the file that broke the import , thankfully sed fixes it easily enough.

sed -i 's/\\"/"/' Addresses.csv

Loading our Data :

Connecting etc.

from py2neo import Graph

#don't want to time out
from py2neo.packages.httpstream import http
http.socket_timeout = 9999

#Connect and get a cypher object
graph = Graph('http://neo4j:neo4j@localhost:7474/db/data')
cypher = graph.cypher

#Clean graph 
#Assert nodes must be unique
cypher.execute("match (n) detach delete n")
cypher.execute("create constraint on (n:Node) assert n.node_id is unique")

Loading our Nodes

#Load Nodes
cypher.execute("""USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:///home/dstorey/Notebooks/Panama-Papers/offshore_leaks_csvs-20160621/Addresses.csv" AS row MERGE (n:Node {node_id:row.node_id}) ON CREATE SET n = row, n:Address;
""")

cypher.execute("""USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:///home/dstorey/Notebooks/Panama-Papers/offshore_leaks_csvs-20160621/Intermediaries.csv" AS row MERGE (n:Node {node_id:row.node_id})  ON CREATE SET n = row, n:Intermediary;
""")

cypher.execute("""USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:///home/dstorey/Notebooks/Panama-Papers/offshore_leaks_csvs-20160621/Entities.csv" AS row MERGE (n:Node {node_id:row.node_id})  ON CREATE SET n = row, n:Entity;
""")

cypher.execute("""USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:///home/dstorey/Notebooks/Panama-Papers/offshore_leaks_csvs-20160621/Officers.csv" AS row MERGE (n:Node {node_id:row.node_id})  ON CREATE SET n = row, n:Officer;
""")

Building our Relationships

The all_edges.csv file there are 85 unique connections between any two nodes. Some of these edges are redundant and other overlap. (i.e. “Beneficial Owner of” and “Beneficial owner of” are both the exact same ; while “Director / Beneficial Owner of” indicates that a connection has many potential interpretations).

In the end I decided to make the following connections “available” to the graph:

  • “POWER_OF_ATORNEY_OF”
  • “SIMILAR_NAME_ADDRESS”
  • “OWNER_OF”
  • “DIRECTOR_OF”
  • “BENIFICIARY_OF”
  • “SIGNATORY_OF”
  • “SHAREHOLDER_OF”
  • “APPOINTOR_OF”
  • “AUDITOR_OF”
  • “CONNECTED_OF”
  • “CORRESPONDANCE_ADDRESS_OF”
  • “CUSTODIAN_OF”
  • “GRANTEE_OF_MORTGAGE_OF”
  • “INTERMEDIARY_OF”
  • “LEGAL_COUNSEL_OF”
  • “TRUSTEE_OF_TRUST
  • “OFFICER_OR_STAFF_OF”
  • “REGISTERED_ADDRESS”
  • “RELATED_TO”
  • “RELATION_OF”
tuple_list = [("Power of attorney of" , "POWER_OF_ATORNEY_OF"),
("Power of Attorney of" , "POWER_OF_ATORNEY_OF"),
("Power of Attorney / Shareholder of" , "POWER_OF_ATORNEY_OF"),
("similar name and address as" , "SIMILAR_NAME_ADDRESS"),
("same address as" , "SIMILAR_NAME_ADDRESS"),
("same name and registration date as" , "SIMILAR_NAME_ADDRESS"),
("Beneficial owner of","OWNER_OF"),
("Beneficial Owner of","OWNER_OF"),
("Owner of","OWNER_OF"),
("Owner, director and shareholder of","OWNER_OF"),
("Alternate Director of", "DIRECTOR_OF"),
("Director and shareholder of"  , "DIRECTOR_OF"),
("Director / Beneficial Owner of" , "DIRECTOR_OF"),
("Director of", "DIRECTOR_OF"),
("Director (Rami Makhlouf) of", "DIRECTOR_OF"),
("Director / Shareholder / Beneficial Owner of", "DIRECTOR_OF"),
("Director / Shareholder of", "DIRECTOR_OF"),
("Beneficiary, shareholder and director of", "DIRECTOR_OF"),
("Nominee Director of", "DIRECTOR_OF"),
("Owner, director and shareholder of", "DIRECTOR_OF"),
("President and director of", "DIRECTOR_OF"),
("President - Director of", "DIRECTOR_OF"),
("Register of Director of", "DIRECTOR_OF"),
("Reserve Director of", "DIRECTOR_OF"),
("Resident Director of", "DIRECTOR_OF"),
("Beneficial owner of" , "BENIFICIARY_OF"),
("Beneficial Owner of" , "BENIFICIARY_OF"),
("Beneficiary, shareholder and director of" , "BENIFICIARY_OF"),
("beneficiary of" , "BENIFICIARY_OF"),
("Beneficiary of" , "BENIFICIARY_OF"),
("First beneficiary of" , "BENIFICIARY_OF"),
("Nominee Beneficial Owner of" , "BENIFICIARY_OF"),
("Nominee Beneficiary of" , "BENIFICIARY_OF"),
("Principal beneficiary of" , "BENIFICIARY_OF"),
("Sole signatory / Beneficial owner of" , "BENIFICIARY_OF"),
("Director / Beneficial Owner of" , "BENIFICIARY_OF"),
("Director / Shareholder / Beneficial Owner of" , "BENIFICIARY_OF"),
("Nominee Beneficial Owner of" , "BENIFICIARY_OF"),
("Authorised Person / Signatory of" , "SIGNATORY_OF"),
("Authorized signatory of" , "SIGNATORY_OF"),
("Auth. Representative of" , "SIGNATORY_OF"),
("Bank Signatory of" , "SIGNATORY_OF"),
("Signatory of" , "SIGNATORY_OF"),
("Sole signatory of" , "SIGNATORY_OF"),
("Beneficiary, shareholder and director of" , "SHAREHOLDER_OF"),
("Director and shareholder of" , "SHAREHOLDER_OF"),
("Director / Shareholder / Beneficial Owner of" , "SHAREHOLDER_OF"),
("Director / Shareholder of", "SHAREHOLDER_OF"),
("Member / Shareholder of" , "SHAREHOLDER_OF"),
("Nominee Shareholder of" , "SHAREHOLDER_OF"),
("Owner, director and shareholder of" , "SHAREHOLDER_OF"),
("Power of Attorney / Shareholder of" , "SHAREHOLDER_OF"),
("Register of Shareholder of" , "SHAREHOLDER_OF"),
("shareholder of" , "SHAREHOLDER_OF"),
("Shareholder of" , "SHAREHOLDER_OF"),
("Shareholder (through Julex Foundation) of" , "SHAREHOLDER_OF"),
("Sole shareholder of" , "SHAREHOLDER_OF"),
("Appointor of" , "APPOINTOR_OF"),
("Auditor of" , "AUDITOR_OF"),
("Connected of" , "CONNECTED_OF"),
("Correspondent Addr. of" , "CORRESPONDANCE_ADDRESS_OF"),
("Custodian of" , "CUSTODIAN_OF"),
("Grantee of a mortgage of" , "GRANTEE_OF_MORTGAGE_OF"),
("intermediary of" ,"INTERMEDIARY_OF"),
("Joint Settlor of" ,"LEGAL_COUNSEL_OF"),
("Legal Advisor of" , "LEGAL_COUNSEL_OF"),
("Records & Registers of" ,"LEGAL_COUNSEL_OF"),
("Safekeeping of" ,"LEGAL_COUNSEL_OF"),
("Successor Protector of" , "LEGAL_COUNSEL_OF" ),
("Trustee of Trust of", "TRUSTEE_OF_TRUST"),
("Trust Settlor of" , "LEGAL_COUNSEL_OF"),
("Unit Trust Register of" , "OFFICER_OR_STAFF_OF"),
("Vice President of" , "OFFICER_OR_STAFF_OF"),
("Member of Foundation Council of" , "OFFICER_OR_STAFF_OF"),
("Member / Shareholder of" , "OFFICER_OR_STAFF_OF"),
("Nominated Person of" , "OFFICER_OR_STAFF_OF"),
("Nominee Investment Advisor of" , "OFFICER_OR_STAFF_OF"),
("Nominee Name of" , "OFFICER_OR_STAFF_OF"),
("Nominee Protector of" , "OFFICER_OR_STAFF_OF"),
("Nominee Secretary of" , "OFFICER_OR_STAFF_OF"),
("Nominee Trust Settlor of" , "OFFICER_OR_STAFF_OF"),
("Officer of" , "OFFICER_OR_STAFF_OF"),
("Partner of" , "OFFICER_OR_STAFF_OF"),
("President and director of" , "OFFICER_OR_STAFF_OF"),
("President - Director of" , "OFFICER_OR_STAFF_OF"),
("President of" , "OFFICER_OR_STAFF_OF"),
("Protector of" , "OFFICER_OR_STAFF_OF"),
("Tax Advisor of"  ,  "OFFICER_OR_STAFF_OF"),
("Treasurer of" , "OFFICER_OR_STAFF_OF"),
("Secretary of" , "OFFICER_OR_STAFF_OF"),
("Stockbroker of" , "OFFICER_OR_STAFF_OF"),
("registered address" , "REGISTERED_ADDRESS"),
("related entity" , "RELATED_TO"),
("rel_type" , "RELATION_OF"),
("Investment Advisor of" , "OFFICER_OR_STAFF_OF"),
("General Accountant of" , "OFFICER_OR_STAFF_OF"),
("Co-Trustee of Trust of", "OFFICER_OR_STAFF_OF"),
("Board Representative of" , "OFFICER_OR_STAFF_OF"),
("Chairman of" , "OFFICER_OR_STAFF_OF"),
("Assistant Secretary of" , "OFFICER_OR_STAFF_OF")]

for i, (a, b) in enumerate(tuple_list):
    insert_statement = """USING PERIODIC COMMIT 10000
    LOAD CSV WITH HEADERS FROM "file:///home/dstorey/Notebooks/Panama-Papers/offshore_leaks_csvs-20160621/all_edges.csv" AS row
    WITH row WHERE row.rel_type = "%s"
    MATCH (n1:Node) WHERE n1.node_id = row.node_1
    MATCH (n2:Node) WHERE n2.node_id = row.node_2
    CREATE (n1)-[:%s]->(n2);
    """ % (a , b)
    cypher.execute(insert_statement)

cypher.execute("CREATE INDEX ON :Officer(name);")
cypher.execute("CREATE INDEX ON :Entity(name);")
cypher.execute("CREATE INDEX ON :Entity(address);")
cypher.execute("CREATE INDEX ON :Intermediary(name);")
cypher.execute("CREATE INDEX ON :Address(address);")
#Drop our constraints

cypher.execute("DROP CONSTRAINT ON (n:Node) ASSERT n.node_id IS UNIQUE;")
cypher.execute("MATCH (n) REMOVE n:Node;")

All of our data is now into the database.

blog comments powered by Disqus