# Bulk Data Overview

Sayari offers bulk exports of the graph data used to support its API and Web Interface products. This page provides an overview of these exports including how they are formatted and how to interpret the data.

# Bulk Data Format

Bulk data exports contain information about entities (AKA nodes, vertices) and relationships (AKA links, edges) in Sayari graph. The data are distributed in two sets of files, one set for entities and one set for relationships between those entities.

When the data are in CSV format, both sets of files are gzipped, with the following characteristics:

  • Each file begins with a header containing column names
  • The delimiter is a comma - ,
  • The quote character (used to escape data that contains the delimiter) is a double quote - "
  • Complex data (lists, maps, etc.) are serialized to JSON before being written

When the data are in parquet format, compression is snappy, and complex data are represented as native parquet structures (not JSON).

# Entities Data

The entity files contain information about entities in the graph. Each entity is uniquely identified by the ​entity_id ​column. However, the same entity can appear multiple times in the dataset, in order to support multiple attributes of the same type for that entity (i.e. aliases). To accommodate multiple rows, an integer column ​i ​is used. For example:

entity_id i name
0-AWvH8du6YBVdRcxTSPUQ 0 {"value":"HESCO ENGINEERING & CONSTRUCTION CO (UK)"}
0-AWvH8du6YBVdRcxTSPUQ 1 {"value":"HESCO ENGINEERING & CONSTRUCTION CO"}
0-AWvH8du6YBVdRcxTSPUQ 2 {"value":"HESCO ENGINEERING AND CONSTRUCTION COMPANY LIMITED"}
0-AWvH8du6YBVdRcxTSPUQ 3 {"value":"HESCO ENG & CON. CO"}
0-FvUP2Fo3TqTfglWwsPsw 0 {"value":"ЛЕОНИД СОФРОНОВИЧ КОРНИЛОВ"}
0-FvUP2Fo3TqTfglWwsPsw 1 {"value":"LEONID SAFRONOVICH KORNILOV"}

This sample shows two entities with information split up over six rows. The first four rows correspond to information about entity ​0-AWvH8du6YBVdRcxTSPUQ​ and the last two rows correspond to information about entity ​0-FvUP2Fo3TqTfglWwsPsw​.

If a single row per entity is desired, the filter ​WHERE i = 0 ​can be applied. This row will include the most commonly cited attribute value of each type (i.e. name, address).

The entities data contain both summary fields and attribute fields. Summary fields include:

  • type​: type of entity
  • label: ​best name for entity
  • label_en​: best ASCII name for entity, if one exists
  • num_documents: ​number of underlying documents entity was extracted from
  • sanctioned: ​whether the entity is sanctioned
  • pep: ​whether the entity is a politically exposed person
  • degree: number of distinct neighboring entities
  • source_counts: ​counts the number of documents the entity was cited in per data source
  • edge_counts​: counts the number of neighbors per edge type

Summary fields are repeated for all rows with the same entity ID. So if an entity has 10 rows 0..10 ​these fields will be constant for every row.

The entity attribute fields are:

  • Shares Shares associated with an entity (e.g. its number of issued shares, or the number of shares held by a shareholder)
  • Gender A person's gender
  • Additional Information A generic attribute used to hold miscellaneous information not covered by any other attribute. Includes 'value' (for the attribute itself), 'type' (a name, e.g. 'Real property description,') and 'extra' (a miscellaneous field to hold any other details) fields.
  • Company Type A type of legal entity in a given jurisdiction (e.g. 'LLC,' 'Sociedad Anonima,' 'Private Company Limited by Shares')
  • Contact Contact information for an entity
  • Country An affiliation of an entity with a given country through residence, nationality, etc.
  • Name An entity's name. The value may be straightforward (e.g. 'Acme LLC,' 'John Doe') or context-specific (e.g. 'Jones v. Smith' as a legal matter name).
  • Position An attribute used for many different relationship types that allows for the inclusion of a title or designation (e.g. member_of_the_board_of, Position: 'Secretary of the Board,' or shareholder_of, Position: 'Minority shareholder')
  • Identifier An ID number that uniquely identifies one entity when value and type are taken into account.
  • Finances A financial figure, typically share capital
  • Translated Name A name that has been translated to English
  • Address A physical location description. Addresses may exist as a simple string ('123 South Main St., South Bend, IN 46556'), or may be in smaller chunks with separate fields ('Number: 123,' 'Street name: South Main...'). Where possible, these fields will be parsed using the Libpostal ontology (https://github.com/openvenues/libpostal#parser-labels), which facilitates more robust address analysis and comparison.
  • Status The status of a non-person entity.
  • Weak Identifier A non-unique ID number, like a partially redacted tax ID or a registry identifier whose value and type may be shared by multiple entities
  • Date Of Birth Birth date of a person
  • Business Purpose Text and/or a code (NAICS, NACE, ISIC, etc.) that describes what a company is legally allowed to do or produce

See the full entities schema below for more details on these attribute fields. The values in these columns for a single entity will change depending on the row. This is illustrated in the above provided sample, where the name column's value changes along with the index ​i​.

# Relationships Data

The relationships files contain information about relationships between the entities in the entities data. Three key fields here are ​src​, ​dst, ​and ​type​:

src dst type
X1hfIDVLf09lSB_FoK7osw nUVBufHeWAaa7UilnXTj3g SHAREHOLDER_OF
_-UUAFZzEKD0-BTQyzbJyw 4k6j6CKzVIgWdYUV-6OhsA SHAREHOLDER_OF
FWW-PkGN0mXMuxLznOL9kQ QzFDUYCAkV1tmhO4x7tqfg LINKED_TO
QhMECOtIAN9FZQ2N8vX9Hw 7Op53EC0EOl1M38FWcd75A SHAREHOLDER_OF
kBmQaD7KoDMCad2IpjbaYw _86dvjUnV8whpBJwHdm4rQ LEGAL_REPRESENTATIVE_OF

So the first row here indicates that the entity with ID ​X1hfIDVLf09lSB_FoK7osw​ is a shareholder of the entity with ID ​nUVBufHeWAaa7UilnXTj3g​. The rows in this table are unique according to these three fields, so there is a only a single row with ​src = X1hfIDVLf09lSB_FoK7osw​, ​dst = nUVBufHeWAaa7UilnXTj3g, ​and ​type = SHAREHOLDER_OF​.

The relationships files contain several date fields:

  • date
  • from_date
  • to_date

These fields are often null, but when populated give information about the time period that the relationship is valid for. Relationships also have the following attribute fields:

  • position
  • additional_information
  • shares

See the full relationships schema below for more details on these attribute fields.

The final relationship field is ​match_keys.​ This field is only populated when ​type = POSSIBLY_SAME_AS ​to indicate that two entities are possibly the same entity. An example of data in the ​match_keys field is as follows:

[
    {
        "key": "house_number",
        "value": "5",
        "entity1": "5",
        "entity2": "5"
    },
    {
        "key": "road",
        "value": "ROOSEVELT STR YALTA CRIMEA",
        "entity1": "Roosevelt Str. Yalta Crimea",
        "entity2": "Roosevelt Str. Yalta Crimea"
    },
    {
        "key": "postcode",
        "value": "98600",
        "entity1": "98600",
        "entity2": "98600"
    },
    {
        "key": "name",
        "value": "YALTA MERCHANT SEA PORT",
        "entity1": "YALTA MERCHANT SEA PORT",
        "entity2": "Yalta Merchant Sea Port"
    }
]

Each item in the array indicates a field that matched between the two entities. ​key ​gives the field name, ​value ​gives the normalized value, ​entity1​ gives the value for the ​src ​entity, and entity2 ​gives the value for the ​dst​ entity. The above sample illustrates that the two entities are possibly the same due to a shared name and partial address match.

# Entities Schema

Below is the full schema for entities files when read in parquet format. The CSV files have the same fields, but with complex fields (lists, maps, etc.) serialized as JSON strings.

root
 |-- entity_id: string (nullable = true)
 |-- i: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- label: string (nullable = true)
 |-- label_en: string (nullable = true)
 |-- num_documents: long (nullable = true)
 |-- sanctioned: boolean (nullable = true)
 |-- pep: boolean (nullable = true)
 |-- degree: long (nullable = true)
 |-- source_counts: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = true)
 |-- edge_counts: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- out: long (nullable = true)
 |    |    |-- in: long (nullable = true)
 |    |    |-- total: long (nullable = true)
 |-- gender: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |-- business_purpose: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- code: string (nullable = true)
 |-- person_status: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |-- finances: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: double (nullable = true)
 |    |-- context: string (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- currency: string (nullable = true)
 |-- name: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- context: string (nullable = true)
 |-- identifier: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- additional_information: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- address: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- house: string (nullable = true)
 |    |-- house_number: string (nullable = true)
 |    |-- po_box: string (nullable = true)
 |    |-- building: string (nullable = true)
 |    |-- entrance: string (nullable = true)
 |    |-- staircase: string (nullable = true)
 |    |-- level: string (nullable = true)
 |    |-- unit: string (nullable = true)
 |    |-- road: string (nullable = true)
 |    |-- metro_station: string (nullable = true)
 |    |-- suburb: string (nullable = true)
 |    |-- city_district: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state_district: string (nullable = true)
 |    |-- island: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- postcode: string (nullable = true)
 |    |-- country_region: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- world_region: string (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- near: string (nullable = true)
 |-- shares: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- num_shares: double (nullable = true)
 |    |-- monetary_value: double (nullable = true)
 |    |-- currency: string (nullable = true)
 |    |-- percentage: double (nullable = true)
 |    |-- type: string (nullable = true)
 |-- company_type: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |-- weak_identifier: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- date_of_birth: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |-- translated_name: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- original: string (nullable = true)
 |    |-- context: string (nullable = true)
 |-- status: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- text: string (nullable = true)
 |-- country: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- context: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- contact: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- position: struct (nullable = true)
 |    |-- extra: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- date: string (nullable = true)
 |    |-- from_date: string (nullable = true)
 |    |-- to_date: string (nullable = true)
 |    |-- value: string (nullable = true)

# Relationships Schema

Below is the full schema for relationships files when read in parquet format. The CSV files have the same fields, but with complex fields (lists, maps, etc.) serialized as JSON strings.

root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- type: string (nullable = true)
 |-- date: string (nullable = true)
 |-- from_date: string (nullable = true)
 |-- to_date: string (nullable = true)
 |-- position: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- extra: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- from_date: string (nullable = true)
 |    |    |-- to_date: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- additional_information: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- extra: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- from_date: string (nullable = true)
 |    |    |-- to_date: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- shares: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- extra: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- from_date: string (nullable = true)
 |    |    |-- to_date: string (nullable = true)
 |    |    |-- num_shares: double (nullable = true)
 |    |    |-- monetary_value: double (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- percentage: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- match_keys: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |    |    |-- entity1: string (nullable = true)
 |    |    |-- entity2: string (nullable = true)