December 2017

Spatial Data Package investigation

An investigation into whether there is a need for a "Spatial Data Package" specification within the Frictionless Data ecosystem

Steve Bennett

Summary

We investigate whether there is a need for a "Spatial Data Package" specification within the Frictionless Data ecosystem. Specifically:

  • The general state of open spatial data and popular file formats.
  • Problems and pain points in sharing open spatial data.
  • If, and how, new Data Package specifications could help solve these problems.

We conclude:

  • There is significant friction in the uptake of spatial data by non-GIS-expert users. This problem can be addressed by two adding a small amount of metadata, and some data processing conventions to existing Data Package standards.

    • Tabular Data Package, with "location" resource attribute, for datasets consisting of points (or rather, rows with point locations).

    • Data Package containing GeoJSON files, with "location" resource attribute, for vector datasets of line and polygon data.

  • There is no good solution for a common kind of semi-spatial data, that of tabular data that links to standard boundaries such as administrative or statistical regions. We propose:

    • Tabular Data Package, with "location" resource attribute linking boundary identifiers to yet-to-be-defined standard codelists.
  • For cases where the above solutions do not meet the package creator's need, including raster data or data that is not under the package creator's direct control, we propose improvements to the emerging Spatial Data Package format already under way.

Introduction

Spatial (or "geospatial") data – data relating to locations on the Earth’s surface – makes up a significant proportion of all the data regularly shared on open data platforms. It is frequently of immediate, direct value and datasets such as property or government administrative boundaries are sought out even by users with little background in geospatial data.

Overview of spatial data

The most commonly shared forms of spatial data are:

  • Vector features: points, lines, polygons and composite types such as multi-polygons (for instance, an archipelago). Usually expressed as "geometry" (the shape of the feature) and "attributes" (data including names, descriptions, or numeric fields associated with the described feature).

  • Raster: an image geo-referenced with spatial coordinates, such as aerial imagery, grids of data calculated across an area, or pre-generated map images.

Other types of geospatial data less commonly shared on government open data platforms include:

  • Tiles: vector or raster data split along pre-defined boundaries to facilitate transport and displaying maps.

  • Multi-dimensional array data and coverages 1: complex data such as climate models which relate to the earth’s surface in three dimensions, plus time, and may depend on other parameters as well. Usually represented with NetCDF.

  • Point clouds: large numbers of points in 3D space, with colour, captured by specialised cameras, to reconstruct a virtual environment.

  • 3D models: fully three-dimensional representations of objects such as buildings or trees located in geospatial space.

  • Video

  • Geo-temporal features: various combinations of space and time dimensions, such as:

    • A point with a data dimension that varies over time, such as the height of a river.

    • A point that moves over time, such as the location of a ship.

    • A moving point with a time-series data dimension, such as a sensor attached to a moving ship.

    • A point situated at a moment in time, such as earthquakes.

    • A point (or other feature) related to a period of time, such as road closures.

  • Curves

There is also data which relates implicitly to geometry not specified directly:

  • Points identified by place name (eg, "Melbourne, Australia"). This is a type of geocoding.

  • Administrative or statistical boundaries identified by identifier (eg, "France").

  • Other data identified in a way relevant to the particular context, such as roads identified by name or number, segments of a city identified by a study area identifier, and so on.

Use cases

Several of the most common open data uses case involving spatial data are:

  • Importing authoritative boundaries (such as property boundaries, or administrative regions such as councils or states) for a wide variety of tasks in urban planning, property development, city management, etc.

  • Assembling maps of layers from different sources (such as roads, trees and rivers) for human interpretation. Modern tools make this kind of task very accessible to people without specialist geospatial expertise.

  • Performing spatial analysis using data from different sources to find patterns between them (such as in population health, transport planning etc).

  • Combining data that relates to boundaries (such as average age per statistical area) with the geometry of those boundaries to produce a choropleth map.

For the purpose of this investigation, we focus on general purpose spatial data, particularly in the context of sharing data between different communities, and especially communities that do not have significant spatial expertise.

Existing spatial file formats for vector data

In addition to the following spatial file formats, there are many spatial data service standards, which we do not consider relevant. Those include WMS, WFS and Esri’s services, and allow accessing portions of data on demand, rather than bulk download.

The discussion below is far from comprehensive. See for instance Pons and Masó (NOTE: http://www.sciencedirect.com/science/article/pii/S0098300416303533?via%3Dihub)’s analysis of 7 formats for packaging spatial data in the sciences, with more stringent requirements such as "metadamodel identity" and "disclosure".

Shapefile

Esri’s Shapefile format is still dominant within government and as a general purpose interchange format, particularly between groups that do not often exchange geospatial data. It has some serious technical shortcomings:

  • Feature property names have a maximum length of 10 characters, lack case, and don’t support Unicode. (They are stored in a .dbf file, in dBase IV format, which dates from 1988 (NOTE: https://www.loc.gov/preservation/digital/formats/fdd/fdd000325.shtml).)

  • A "Shapefile" is actually at least 3 files, and possibly more. This is inconvenient for use cases such as uploading into the browser, which usually requires zipping the files together first.

  • Shapefiles can use essentially any coordinate reference system. The associated .prj file which defines this system is optional, meaning the consumer may have no direct way to determine the projection of a given file (NOTE: https://gis.stackexchange.com/questions/7839/identifying-coordinate-system-of-shapefile-when-unknown).

  • The geometry and attribute files are binary, which makes manipulating them by hand or using web technologies difficult.

  • Maximum file size of 2GB, inability to mix geometry, poor projection interoperability, maximum of 255 attributes and other issues. (NOTE: http://switchfromshapefile.org/)

GML

The Geography Markup Language (GML) (NOTE: http://www.opengeospatial.org/standards/gml) is an XML standard from the Open Geospatial Consortium. Like many OGC standards, it is extremely comprehensive and detailed, which makes it also cumbersome to work with. It (and its variants) are used in many specialised domains but less so as a general purpose data interchange format.

GeoJSON

GeoJSON, released in 2008 then updated in 2016, is a single-file format based on JSON, geared towards web applications. It supports points, lines, polygons, multipoints, multilines and multipolygons, properties on those objects (with JSON’s type system of integers, floats, strings, booleans and nulls) with well-defined orderings for topologies such as rings.

Benefits include:

  • Can be read and written by hand.

  • Geodata can be embedded directly within software source code or within other JSON data files (eg, Mapbox style files (NOTE: https://www.mapbox.com/mapbox-gl-js/style-spec/#sources-geojson)).

  • Very easy to load, manipulate and save in a wide variety of languages, without requiring specialist geospatial libraries.

  • Extensible, additional data and metadata can be easily added.

  • The specification itself is short and easily understandable.

Some shortcomings include:

  • Somewhat space-inefficient compared to binary formats or TopoJSON. (A protocol-buffer encoding of GeoJSON can rectify this (NOTE: https://github.com/mapbox/geobuf)).

  • Coordinates must be specified in lat/long (EPSG:4326). Projected coordinates are intentionally not supported.

  • No support for raster data including rendered tiles.

  • No standard support for styling. Some basic styling supported through the simplestyle-spec convention.

  • Non-compliant GeoJSON is easy to produce, and is somewhat common.

  • No mechanism for schemas. There is no standard way to describe what the fields mean, or indicate whether every feature has every field.

GeoJSON is very well supported in modern web mapping applications such as Mapbox, Leaflet, and Turf, but remains somewhat patchy in desktop applications.

GeoPackage (2014)

GeoPackage is a rich geospatial data format defined as a way of packaging data within an SQLite container, geared towards mobile applications. Thus any environment which supports SQLite also supports random access on GeoPackage data files, querying, filtering and so on.

GeoPackage can store features, and also tile "pyramids" - sets of raster or vector data carved into pre-defined units to facilitate easy transport of data to browsers or mobile applications for displaying maps.

Although useful for certain applications it is not commonly seen as an interchange format.

CSV

Comma-separated value files are widely used for distributing non-spatial data. There are various ways of representing spatial data in CSVs, but no convention or standard has achieved dominance. Each feature is generally represented as a row, with one or two geometry columns.

CSV is appealing as an open data format for point data because it is familiar to a large audience of people who do not regularly use spatial data. It thus allows data to flow across domains of practice, which is particularly valuable.

Point data can be represented in CSV several different ways including:

Other forms of vector data can be expressed in a geometry column using:

Related: CSV Dialect.

Because there is not any conventional way to attach metadata to a CSV file describing which of these methods is being used, some tools have defined their own. For instance, GDAL uses an optional file called VRT to define exactly which fields contain geometry (NOTE: http://www.gdal.org/drv_csv.html).

Attempts to define how geospatial data is captured within CSV include:

So, overall, disadvantages of CSV include:

  • Poor standardisation of the CSV format itself.

  • Poor standardisation of geospatial data within CSV.

  • Lack of type information: numbers and strings are not differentiated.

  • No standard place to put metadata.

  • Inability to distinguish a CSV file containing location information from one that doesn’t. (For this reason, the former type are given a resource type of "csv-geo-au" on data.gov.au, rather than "csv")

Spatial data sharing pain points

Size

Once files get larger than ~10MB, previewing a whole GeoJSON file in the browser becomes impractical. Setting up a spatial data server such as GeoServer is too complex for the reward, so the fall-back is to download the whole dataset and preview locally in QGIS.

Projections

Spatial data is generally developed within a specific projection adopted by the relevant organisation. Although it is essentially trivial to reproject data from one coordinate reference system to another, this requires three things:

  1. Awareness that data can exist in different projections, and that whatever the user is attempting to do may require a different projection.

  2. Knowledge of what projection the source data is in, and getting that information expressed in the right format (often an EPSG code, such as EPSG:4283)

  3. Using a tool such as GDAL’s OGR2OGR to perform the projection.

For a GIS professional, this process might take a couple of minutes. It can completely stump a GIS newbie, or leave them floundering for hours. According to Tom Macwright, formerly Mapbox technical lead, and heavily involved in developing several open geospatial standards (NOTE: https://macwright.org/2015/03/23/geojson-second-bite):

Data projections are friction.* If you aren’t a surveyor and don’t actually have centimeter-accuracy data, using projections adds friction for users: instead of simply downloading and using data, they need to determine the projection - sometimes manually - and occasionally even need to load in new projection definitions in order to use it. And they usually, off the bat, just convert it to EPSG:4326.*

So, the take-home lesson of data projections is that they’re useful for extremely-high precision datasets. But such data is rare, and usually the GeoJSON default of EPSG:4326 is a better choice for sharing and storing data.

Geocoding place names

Converting a place name to a location requires using a geocoding or gazetteer service. Many national and subnational gazetteers exist, as well as a few international ones:

Issues:

  • There’s no conventional way to define the column(s) containing the place name data.

  • There’s no standard list of place names, nor is such a thing likely to exist, due to the nature of small places, diverse language and spelling, and varying administrative practices.

Metadata

There do not appear to be any widely used, open standards for documenting basic geospatial metadata independent of the data itself. ISO 19115-1:2014 provides a metadata schema, implemented by Esri’s Shapefile (NOTE: http://desktop.arcgis.com/en/arcmap/10.3/manage-data/metadata/support-for-iso-metadata-standards.htm) but is closed. This means there is not a convenient way to provide accessible information about:

  • The full name, description, provenance and other information about each layer.

  • The full name, description and type of each property.

Boundary-linked data

It is very common for data to reference boundaries which are assumed to be known to the consumer. These may be administrative (such as country or state borders), statistical (such as census districts or Australia’s "statistical areas" (NOTE: http://www.abs.gov.au/websitedbs/D3310114.nsf/home/Australian+Statistical+Geography+Standard+(ASGS))), or a diverse range of other possibilities, such as water catchments, ecological zones, tourism districts etc.

Although tools such as CARTO and Tableau support the concept of linking place names to pre-defined geometries, there does not appear to be any standard way of managing this process, or even describing it:

We use the term "boundary-linked data".

A global solution needs:

  • A defined identifier for the type of boundary being referenced. (eg, "AU:State_name")

  • A defined set of identifiers for those boundaries. (eg, "Victoria")

  • A way to specify different versions of those boundaries (eg, "AU:State_name:2008").

Current solutions:

  • GADM (NOTE: http://www.gadm.org/) (Global Administrative Areas): a downloadable database of 294,430 administrative areas.

  • MapIt (NOTE: http://mapit.openlocal.org.au/) is an open-source (NOTE: https://github.com/mysociety/mapit) platform providing a lookup service. Given a lat/long, it returns a list of boundaries that the point is within. It is not clear if the supported boundaries follow any standards. The UK deployment contains boundaries of Greater London Authority (GLA), London borough (LBO), London borough ward (LBW), European region (EUR), London assembly constituency (LAC), "2001 Middle Layer Super Output Area (Full)" (OMF), "2001 Middle Layer Super Output Area (Generalised)" (OMG) and a number more.

Relevant standards:

There does not yet appear to be a global standard for statistical geography, although a Global Statistical Geospatial Framework (NOTE: http://www.efgs.info/information-base/production-model/global/
) is being developed through the UN (NOTE: http://ggim.un.org/ggim_20171012/docs/meetings/3rd UN-EG-ISGI/3rd UN EG ISGI - session2 - Global SGF summary.pdf), based on Australia’s Australian Statistical Geographical Standard.

Summary:

  • A standard way to link to countries and first (and sometimes second) level administrative regions is easy, with ISO 3166.

  • There is no existing global standard for statistical boundaries.

Solution?

A solution in this area would have to be extensible and global in scope. If it followed the approach of csv-geo-au in which the column name provides all needed context, it would probably have to allow boundaries of limited scope to be namespaced within larger units. (The namespacing itself could follow ISO 3166). For instance:

  • Country - country borders, global scope (not namespaced)

  • UK:Postcode - postcodes within the UK

  • AU:VIC:Electorate - State electorates within the Australian state of Victoria

Can Frictionless Data help?

When considering new specifications we have to weigh up:

  • Can they address some of the pain points described above?

  • How will they fit into existing Data Package workflows?

  • How will they meet the design principles of Frictionless Data, such as "zen-like simplicity".

    • There are already around 10 Frictionless Data specifications, and contributing to this number risks creating complexity and confusion.

    • Secondly, the range of spatial data may be too broad for a single specification to make sense. Datasets of points with many attributes have more in common with tabular data than with raster data, for instance.

There is an overlap between spatial "point" data and "tabular" data. Consider a dataset of wildlife sightings reported by a city’s residents, each including the species, identification notes, date and time, and location. (NOTE: For instance, https://data.melbourne.vic.gov.au/Environment/BioBlitz-2014/a945-pqqr) Is this a tabular dataset (suitable for analysis with a spreadsheet application), or a spatial dataset (suitable for analysis with GIS or making a map)? In fact, it’s both.

We do not want to create friction by forcing data custodians to choose between making such a dataset a "tabular" or "spatial" Data Package, nor force consumers to separately look in "spatial" vs "non-spatial" catalogues.

Data Package workflows to consider

Tool category Example
Libraries that load a DP into a programming environment for analysis. Datapackage-rb, tableschema-py and many others.
Tools that assemble DPs from heterogeneous sources, with hand-crafted scripts. DataRetriever
Platforms that offer DPs as an output format. Data.World
Tools that allow a data custodian to package their data as a DP. DataPackagist, DataCurator
Catalogues and discovery tools that facilitating finding and downloading DPs. DataHub, DataRetriever
Analysis platforms that accept DPs as input. OpenSpending

These can be boiled down to:

  1. Authoring tools.

  2. Distribution infrastructure.

  3. Consumer-facing analysis tools.

  4. Direct access by consumers without DP-specific tooling.

We thus consider each of the major kinds of spatial data, and how each should best be handled within the Data Package world:

  • Points

  • Simple vector data (lines and polygons)

  • Tabular data linked to standard boundaries

  • Complex spatial data, including raster data

Comprehensive or minimal?

There is a choice in how prescriptive a specification should be. We will look at two options, called for the sake of argument "minimal SDP" and "comprehensive SDP":

  • Minimal SDP:

    • Point data prepared in a strict CSV format with optional GeoJSON alternative representation, plus metadata.

    • Other vector data prepared as GeoJSON, plus metadata. No support for other spatial formats, projections, or raster data.

    • Does not necessarily involve a new DP profile.

  • Comprehensive SDP: broad range of spatial types, formats and projections supported, so most existing spatial data on the web can be packaged without altering it.

We can consider formalising the Minimal SDP, the Comprehensive SDP, or both.

Only minimal SDPs

  1. Authoring tools will take shapefiles and other vector formats and spit out either TDPs or minimal SDPs, likely requiring GDAL. The heavy lifting happens here.

  2. Distribution infrastructure can easily provide visualisations of TDPs or minimal SDPs and use libraries such as Turf (NOTE: http://turfjs.org).

  3. Consumer-facing analysis tools have an easy job, because they’re dealing with a narrow range of formats.

  4. Direct access by consumers is facilitated, because they don’t need to deal with different formats, projections, etc. Any minimal SDP always behaves the same way.

Only comprehensive SDPs

  1. Authoring tools are simplified because they don’t really need to look inside the spatial data being wrapped. They just ask questions like what projection and format it is, and maybe attempt to compute bounds.

  2. Distribution infrastructure has a harder time. It can visualise some formats, but there’s a lot of complexity to constantly manage.

  3. Consumer-facing analysis tools also have a hard time. They can’t do much without a full spatial library like GDAL. (Compared to, say, TurfJS).

  4. There is not much benefit for consumers accessing data directly over the current state of affairs. They get some benefits from layers being properly described, and the projection is recorded if they look in the datapackage.json.

Both

In this scenario, minimal SDPs are strongly preferred, where appropriate, but comprehensive SDPs can be created if needed.

  1. Authoring tools could support either or both.

  2. Distribution infrastructure could easily support minimal SDPs with some support for comprehensive SDPs if useful. This tradeoff would be expected by those creating comprehensive SDPs.

  3. Consuming-facing analysis tools similarly have a reasonable trade-off.

  4. Consumers directly accessing DPs win when they are minimal. There is potential for confusion if the types are not well explained and named. Consumers of comprehensive DPs in certain domains win by having spatial data in their expected format, with extra metadata.

Recommendation

We lean towards the "both" option, with further discussion with the community to choose a path forward, and to come up with clear names and guidance.

Recommendations

Location metadata

Given that location information can exist in conjunction with other kinds of data, we recommend that two types of location metadata be included in Data Package descriptors where appropriate:

  • Package-level "spatial-profile" attribute indicates that the Data Package contains location information, and what sort it is. This makes filtering for location-containing Data Packages easy. (This attribute may be superfluous, in that it can be inferred from attributes on the resources.)

  • Resource-level "locations" attribute indicates where (if at all) location information can be found within a resource, and how to interpret it. Multiple kinds of location information may exist simultaneously.

We do not recommend including resource-level location information within a "schema" attribute because:

  • The structure of "schema" is undefined for Data Resource.

  • A multi-column lat-lon location could not easily be described in Table Schema.

Package-level "spatial-profile" attribute

One of:


{

  "spatial-profile": "tabular-points",
  "spatial-profile": "simple-vector",
  "spatial-profile": "raster",
  "spatial-profile": "vector"

}
    
Resource-level "locations" attribute

  "resources": [
    {
      "path": "data.csv",
      // REQUIRED (to use location information):
      // Array of one or more sources of location information, within the resource.
      // For instance, a resource may contain both point and boundary information,
      // or two kinds of point information.
      "locations": [{
        // REQUIRED: location type specifier, one of:
        // "lat-lon": point defined by two columns
        // "boundary-id": linked boundary defined by column with IDs
        // "geojson": geometry provided as geojson
        "type": "lat-lon",
        // other fields determined by location type
        }
      ]
    }
  ]

Point datasets

Recommendation for creators:

  1. Point datasets SHOULD be published as Tabular Data Package, in CSV, with locations represented as "Latitude" and "Longitude" columns.

  2. These columns SHOULD be given appropriate types when such are supported in Table Schema.

  3. "locations" and "spatial-profile" metadata SHOULD be included, to indicate that the TDP is also a spatially-enriched dataset, and how to interpret the location information.

  4. For maximum reusability, a GeoJSON version of the data should also be included within the Data Package. (NOTE: It cannot be included as a resource without breaking the rules of Tabular Resource, but can be included in the package nonetheless.)

Currently, Table Schema supports "geopoint" and "geojson" location types, which we do not recommend. We propose individual "latitude" and "longitude" types.

Indicative "locations" specification

  "locations": [
    {
      "type": "lat-lon",
      "fields": {
        // REQUIRED: name of fields containing lat/lon (decimal degrees, eg "-37.8").
        //TBD: should they default to "Longitude", "Latitude"?
        "latitude": "lat,
        "longitude": "lon"
      },
      // OPTIONAL:
      // local path to equivalent representation in GeoJSON format, for convenience.
      "geojson-path": "data.geojson",
      // OPTIONAL, controlled vocab (TBD): the relationship of the location to
      // each row, primarily in the case where there is more than one location.
      // For instance, "start" and "end".
      // "role": "start"
      }
    ]

Line and polygon ("simple vector") datasets

The existing GeoJSON standard serves the need of sharing general-purpose vector data well.

Recommendation for creators:

  1. Simple vector datasets SHOULD be published as a GeoJSON file inside a Data Package.

  2. GeoJSON data SHOULD NOT be included inline in the Data Package descriptor.

  3. A "locations" attribute of type "geojson" SHOULD be included on the resource.

  4. Field-level descriptions MAY be included, using a "schema" attribute in Table Schema format. The location information itself does not count as a "field".

Example of "locations" attribute:


"locations": {
  "type": "geojson"
  }

Example of "schema" attribute:


  "schema": {
    "fields": [
      {
        "name": "name",
        "description": "The park's official name."
      },
      {
        "name": "areasqm",
        "description": "The area of the park in square metres."
      }
    ]
  }

Tabular data linked to standard boundaries

We can break down the problem of linking tabular data to standard boundaries into two independent questions, following the Csv-geo-au model:

  1. How should the tabular data be formatted and described in order to indicate precisely how a given row links to a boundary in the real world?

  2. How should spatial representations of those boundaries be prepared and made linkable, so that correctly prepared tabular datasets can be visualised?

We use the example of a dataset of populations by country. Countries can be unambiguously identified using three-letter codes (ISO 3166-1 alpha-3 (NOTE: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3)), for instance TUR for Turkey.

Note that a row may have several boundary-linking columns, especially several administrative levels (such as local governments and states).

Preparing boundary-linked tabular data

Recommendation for creators:

  1. Boundary-linked tabular data should be prepared as Tabular Data Package.

  2. A column containing the boundary IDs SHOULD be named as explicitly as possible (eg "Country ISO 3166-1 alpha-3 code")

  3. A "locations" attribute on the Tabular Resource indicates the boundary ID column

Sample "locations" metadata:


  "locations": [
    {
      // REQUIRED: described above.
      "type": "boundary-id",
      // REQUIRED: the name of the field containing the identifiers.
      "field": "Council ABS ID",
      // REQUIRED: a codelist from a predefined (TBD) set, in hyphenated lower case.
      // Colon (:) indicates a subset which is not always present. Possibilities that make immediate sense:
      // "iso-3166-1:alpha-2": 2-letter country codes
      // "iso-3166-1:alpha-3": 3-letter country codes
      // "iso-3166-2": 5-, 6- or 7-letter character administrative subdivision codes (eg "FR-33")
      // "nuts-1": 1st-level NUTS code for EU (eg "AT3" = Western Austria)
      // "nuts-2": 2nd-level NUTS code for EU (eg "AT33" = Tyrol)
      // "nuts-3": 3nd-level NUTS code for EU (eg "AT332" = Innsbruck)
      // "csv-geo-au": Australian statistical and administrative boundaries as defined by csv-geo-au standard.
      "codelist": "csv-geo-au:lga_code",
      // OPTIONAL: an identifier for the specific version of the boundaries (often a year).
      "version": "2011",
      // OPTIONAL (TBD): local or web path to an actual source of those boundaries, in the absence of a codelist-resolving service.
      // this could also support the (unverified) use case of attributes and boundaries supplied separately in the same DP.
      "geometrypath": "http://..."
    }
  ]
Providing boundaries for linking

It is useful, but not essential, if there exist geospatial datasets corresponding to boundaries being linked to. (Without them, it is still useful to know that a given row contains information about a specific boundary. With them, it could be automatically visualised on a map.)

Defining exactly how this would work is beyond the scope of this report, but an indicative idea:

  • Boundaries published as GeoJSON Data Packages, with metadata indicating which property contains the relevant codelist identifier.

  • A set of identifiers managed, perhaps as a separate Frictionless Data specification (akin to CSV Dialect) which define the values for "codescheme", "codelist" and "version", and link to locations where those boundaries can be obtained.

  • Perhaps services that provide all the boundaries on demand, as vector tiles, for immediate web visualisation.

An example of this kind of mapping is included in TerriaJS’s regionMapping.json configuration file (NOTE: https://github.com/TerriaJS/terriajs/blob/master/wwwroot/data/regionMapping.json).

Recommendation for Frictionless Data community:

  1. Develop and implement a standard for boundary codelist schemes and identifiers.

  2. Implement a service for resolving such identifiers to mappable geometries.

An alternative approach would be to use Table Schema’s "foreignkey" (NOTE: https://frictionlessdata.io/specs/table-schema/#foreign-keys) element to link directly to a feature in an external Data Package containing the relevant geometry. This approach has several weaknesess:

  • The "foreignkey" element does not explicitly support linking to external resources in this way.

  • The tabular dataset would be linked to a specific file representation of the boundaries in question, rather than a semantic abstraction of them. This raises many problems such as establishing equivalency between different copies of the boundaries, and inability to link to boundaries which do not yet exist in a concrete file form.

  • Setting metadata on a column’s "schema" definition wouldn’t work for point datasets (which would have location defined across two columns), so the two kinds of metadata would be inconsistent.

Complex spatial data

Finally there are all the kinds of spatial data which can’t be converted to GeoJSON either:

  • Because the formats are fundamentally incompatible (such as raster data)

  • Because it is considered impracticable or undesirable to do so for some reason. For instance, when bundling a third party’s data, it may be preferable not to manipulate the files themselves. Or when data in a certain format is intended for a certain community which prefers that data over GeoJSON.

There has already been work done developing a "Geo Data Package" (discussion here (NOTE: https://github.com/frictionlessdata/specs/issues/86), sample outputs here (NOTE: https://github.com/henrykironde/spatial-packages/tree/master/data)). It does not appear to be formally documented.

The sample files demonstrate these attributes for Data Packages containing vector files:

  • driver_name (eg "ESRI Shapefile")

  • extent { xMax, xMin, yMax, yMin }

  • format ("vector")

  • spatial_ref (appears to be an Esri PROJCS[] string)

  • resources

    • geom_type: (eg "Point", "Polygon", "3D Polygon")

And for raster files:

  • driver: "EHdr"

  • datum

  • projection

  • transform

  • resources:

    • band_name

    • no_data_value

    • min

    • max

We would suggest:

  • If "driver_name" is a reference to a a GDAL format code (NOTE: That is, a value in the "Code" column here: http://www.gdal.org/ogr_formats.html or http://www.gdal.org/formats_list.html ), then perhaps call it "gdal_format".

  • Group the spatial data attributes together (eg, under a "spatial" object)

  • Include an extent in unprojected (lat-long) coordinates (eg, "extent_latlong")

  • Consider using a non-proprietary projection string, plus an EPSG code if relevant. (All obtainable from spatialreference.org). Perhaps:


spatial_ref: {
  projcs: "...",
  proj4: "...",
  epsg: "..."
}