What geospatial data is supported?¶
In general, the Python Geo Interface is supported; see https://gist.github.com/sgillies/2217756.
There are three layers of information, modeled after GeoJSON (see https://tools.ietf.org/html/rfc7946):
The outermost layer is a FeatureCollection
Within a FeatureCollection are Features, each of which represents one “thing”, like a building or a river. Each feature has:
type (string; required)
id (object; required)
geometry (Geometry, see below; optional)
bbox (“bounding box”, 4 doubles; optional)
properties (map[string, object]; optional)
Within a Feature is a Geometry, which represents a shape, like a point or a polygon. Each geometry has:
type (one of “Point”, “MultiPoint”, “LineString”, “MultiLineString”, “Polygon”, “MultiPolygon”; required)
coordinates (doubles; exactly how these are structured depends on the type of the geometry)
Although the Python Geo Interface is non-prescriptive when it comes to the data types of the id and properties, Tamr has a more restricted set of supported types. See https://docs.tamr.com/reference#attribute-types.
Dataset class supports the
__geo_interface__ property. This will produce one
FeatureCollection for the entire dataset.
There is a companion iterator
itergeofeatures() that returns a generator that allows you to
stream the records in the dataset as Geospatial features.
To produce a GeoJSON representation of a dataset:
dataset = client.datasets.by_name("my_dataset") with open("my_dataset.json", "w") as f: json.dump(dataset.__geo_interface__, f)
itergeofeatures() will use the first dataset attribute with geometry type to fill in the feature geometry. You can override this by specifying the geometry attribute to use in the
geo_attr parameter to
Dataset can also be updated from a feature collection that supports the Python Geo Interface:
import geopandas geodataframe = geopandas.GeoDataFrame(...) dataset = client.dataset.by_name("my_dataset") dataset.from_geo_features(geodataframe)
Note that there are currently some limitations to GeoPandas’ implementation of the Geo Interface. See below for more details.
By default the features’ geometries will be placed into the first dataset attribute with geometry
type. You can override this by specifying the geometry attribute to use in the
Rules for converting from Tamr records to Geospatial Features¶
The record’s primary key will be used as the feature’s
id. If the primary key is a single attribute, then the value of that attribute will be the value of
id. If the primary key is composed of multiple attributes, then the value of the
id will be an array with the values of the key attributes in order.
Tamr allows any number of geometry attributes per record; the Python Geo Interface is limited to one. When converting Tamr records to Python Geo Features, the first geometry attribute in the schema will be used as the geometry; all other geometry attributes will appear as properties with no type conversion. In the future, additional control over the handling of multiple geometries may be provided; the current set of capabilities is intended primarily to support the use case of working with FeatureCollections within Tamr, and FeatureCollection has only one geometry per feature.
An attribute is considered to have geometry type if it has type
RECORD and contains an attribute named
If an attribute named
bbox is available, it will be used as
bbox. No conversion is done on the value of
bbox. In the future, additional control over the handling of
bbox attributes may be provided.
All other attributes will be placed in
properties, with no type conversion. This includes all geometry attributes other than the first.
Rules for converting from Geospatial Features to Tamr records¶
id will be converted into the primary key for the record. If the record uses a simple key, no value translation will be done. If the record uses a composite key, then the value of the Feature’s
id must be an array of values, one per attribute in the key.
If the Feature contains keys in
properties that conflict with the record keys,
bbox, or geometry, those keys are ignored (omitted).
If the Feature contains a
bbox, it is copied to the record’s
All other keys in the Feature’s
properties are propagated to the same-name attribute on the record, with no type conversion.
Streaming data access¶
itergeofeatures() returns a generator that allows you to stream the records in the dataset as Geospatial features:
my_dataset = client.datasets.by_name("my_dataset") for feature in my_dataset.itergeofeatures(): do_something(feature)
Note that many packages that consume the Python Geo Interface will be able to consume this iterator directly. For example::
from geopandas import GeoDataFrame df = GeoDataFrame.from_features(my_dataset.itergeofeatures())
This allows construction of a GeoDataFrame directly from the stream of records, without materializing the intermediate dataset.
Note on GeoPandas data access¶
There is a current limitation in GeoPandas that causes the feature’s ID field to be ignored in certain scenarios. The Tamr primary key is stored in this field.
The result is that when loading data and updating records through the
dataset.from_geo_features() method, records will not be overwritten as anticipated.
This issue can be circumvented by loading features into GeoPandas by re-inserting the id field into the data.
my_dataset = client.datasets.by_name("my_dataset") for feature in my_dataset.itergeofeatures(): primary_key = feature['id'] df = gpd.GeoDataFrame.from_features([feature]) do_something(df) geo.index = [primary_key] my_dataset.from_geo_features(df)
Alternatively, it is possible to load the full dataset as follows:
my_dataset = client.datasets.by_name("my_dataset") def geopandas_dataset(dataset): for feature in dataset.itergeofeatures(): feature['properties']['primary_key'] = feature['id'] yield feature df = gpd.GeoDataFrame.from_features(geo_dataset(my_dataset)) df.set_index('primary_key') do_something(df) my_dataset.from_geo_features(df)