Dataset¶

class buzzard.Dataset(sr_work=None, sr_fallback=None, sr_forced=None, analyse_transformation=True, allow_none_geometry=False, allow_interpolation=False, max_active=inf, debug_observers=(), **kwargs)[source]¶

Dataset is a class that stores references to sources. A source is either a raster, or a vector. A Dataset allows:

quick manipulations by optionally assigning a key to each registered source (see Sources Registering below),
closing all source at once by closing the Dataset object.

But also inter-sources operations, like:

limiting maximum number of file descriptors,
spatial reference harmonization (see On the fly re-projections in buzzard below),
workload scheduling on pools when using async rasters (see Scheduler below),
other features in the future (like data visualization).

For actions specific to opened sources, see those classes:

Warning

This class is not equivalent to the gdal.Dataset class.

Parameters

sr_work: None or string

In order to set a spatial reference, use a string that can be converted to WKT by GDAL.

(see On the fly re-projections in buzzard below)

sr_fallback: None or string

In order to set a spatial reference, use a string that can be converted to WKT by GDAL.

(see On the fly re-projections in buzzard below)

sr_forced: None or string

In order to set a spatial reference, use a string that can be converted to WKT by GDAL.

(see On the fly re-projections in buzzard below)

analyse_transformation: bool

Whether or not to perform a basic analysis on two sr to check their compatibility.

if True: Read the buzz.env.significant variable and raise an exception if a spatial reference conversions is too lossy in precision.

if False: Skip all checks.

(see On the fly re-projections in buzzard below)

allow_none_geometry: bool

Whether or not a vector geometry should raise an exception when encountering a None geometry

allow_interpolation: bool

Whether or not a raster geometry should raise an exception when remapping with interpolation is necessary.

max_active: nbr >= 1

Maximum number of pooled sources active at the same time. (see Sources activation / deactivation below)

debug_observers: sequence of object

Entry points to observe what is happening in the Dataset’s sheduler.

Examples

>>> import buzzard as buzz

Creating a Dataset.

>>> ds = buzz.Dataset()

Opening a file and registering it under the ‘roofs’ key. There are four ways to the access this source.

>>> r = ds.open_vector('roofs', 'path/to/roofs.shp')
... feature_count = len(ds.roofs)
... feature_count = len(ds['roofs'])
... feature_count = len(ds.get('roofs'))
... feature_count = len(r)

Opening a file anonymously. There is only one way to access that source.

>>> r = ds.aopen_raster('path/to/dem.tif')
... data_type = r.dtype

Opening, reading and closing raster files with context management.

>>> with ds.open_raster('rgb', 'path/to/rgb.tif').close:
...     data_type = ds.rgb.fp
...     arr = ds.rgb.get_data()

>>> with ds.aopen_raster('path/to/rgb.tif').close as rgb:
...     data_type = rgb.dtype
...     arr = rgb.get_data()

Creating files

>>> ds.create_vector('targets', 'path/to/targets.geojson', 'point', driver='GeoJSON')
... geometry_type = ds.targets.type

>>> with ds.acreate_raster('/tmp/cache.tif', ds.dem.fp, 'float32', 1).delete as cache:
...     file_footprint = cache.fp
...     cache.set_data(dem.get_data())

Sources Types

Raster sources
- GDAL drivers http://www.gdal.org/formats_list.html (e.g. ‘GTIff’, ‘JPEG’, ‘PNG’, …)
- numpy.ndarray
- recipes
Vector sources
- OGR drivers: https://www.gdal.org/ogr_formats.html (e.g. ‘ESRI Shapefile’, ‘GeoJSON’, ‘DXF’, …)

Sources Registering

There are always two ways to create a source, with a key or anonymously.

When creating a source using a key, said key (e.g. the string “my_source_name”) must be provided by user. Each key identify one source and should thus be unique. There are then four ways to access that source:

using object returned by the method that created the source,
from the Dataset using the attribute syntax: ds.my_source_name,
from the Dataset using the item syntax: ds[“my_source_name”],
from the Dataset using the get method: ds.get(“my_source_name”).

All keys should be unique.

When creating a source anonymously you don’t have to provide a key, but the only way to access this source is to use the object returned by the method that created the source.

Sources activation / deactivation

The sources that inherit from APooledEmissary (like GDALFileVector and GDALFileRaster) are flexible about their underlying driver object. Those sources may be temporary deactivated (useful to limit the number of file descriptors active), or activated multiple time at the same time (useful to perfom concurrent reads).

Those sources are automatically activated and deactivated given the current needs and constraints. Setting a max_active lower than np.inf in the Dataset constructor will ensure that no more than max_active driver objects are active at the same time, by deactivating the LRU ones.

On the fly re-projections in buzzard

A Dataset may perform spatial reference conversions on the fly, like a GIS does. Several modes are available, a set of rules define how each mode work. Those conversions concern both read operations and write operations, all are performed by the OSR library.

Those conversions are only perfomed on vector’s data/metadata and raster’s Footprints. This implies that classic raster warping is not included (yet) in those conversions, only raster shifting/scaling/rotation work.

The z coordinates of vectors geometries are also converted, on the other hand elevations are not converted in DEM rasters.

If analyse_transformation is set to True (default), all coordinates conversions are tested against buzz.env.significant on file opening to ensure their feasibility or raise an exception otherwise. This system is naive and very restrictive, use with caution. Although, disabling those tests is not recommended, ignoring floating point precision errors can create unpredictable behaviors at the pixel level deep in your code. Those bugs can be witnessed when zooming to infinity with tools like qgis or matplotlib.

On the fly re-projections in buzzard - Terminology

sr: Spatial reference
sr_work: The sr of all interactions with a Dataset (i.e. Footprints, extents, Polygons…), may be None.
sr_stored: The sr that can be found in the metadata of a raster/vector storage, may be None.
sr_virtual: The sr considered to be written in the metadata of a raster/vector storage, it is often the same as sr_stored. When a raster/vector is read, a conversion is performed from sr_virtual to sr_work. When writing vector data, a conversion is performed from sr_work to sr_virtual.
sr_forced: A sr_virtual provided by user to ignore all sr_stored. This is for example useful when the sr stored in the input files are corrupted.
sr_fallback: A sr_virtual provided by user to be used when sr_stored is missing. This is for example useful when an input file can’t store a sr (e.g. DFX).

On the fly re-projections in buzzard - Dataset parameters and modes

mode	sr_work	sr_fallback	sr_forced	How is the sr_virtual of a source determined
1	None	None	None	Use sr_stored, no conversion is performed for the lifetime of this Dataset
2	string	None	None	Use sr_stored, if None raises an exception
3	string	string	None	Use sr_stored, if None it is considered to be sr_fallback
4	string	None	string	Use sr_forced

On the fly re-projections in buzzard - Use cases

If all opened files are known to be written in a same sr in advance, use mode 1.
No conversions will be performed, this is the safest way to work.
If all opened files are known to be written in the same sr but you wish to work in a different sr, use mode 4.
The huge benefit of this mode is that the driver specific behaviors concerning spatial references have no impacts on the data you manipulate.
On the other hand if you don’t have a priori information on files’ sr, mode 2 or mode 3 should be used.

Warning

Side note: Since the GeoJSON driver cannot store a sr, it is impossible to open or create a GeoJSON file in mode 2.

On the fly re-projections in buzzard - Examples

mode 1 - No conversions at all

>>> ds = buzz.Dataset()

mode 2 - Working with WGS84 coordinates

>>> ds = buzz.Dataset(
...     sr_work='WGS84',
... )

mode 3 - Working in UTM with DXF files in WGS84 coordinates

>>> ds = buzz.Dataset(
...     sr_work='EPSG:32632',
...     sr_fallback='WGS84',
... )

mode 4 - Working in UTM with unreliable LCC input files

>>> ds = buzz.Dataset(
...     sr_work='EPSG:32632',
...     sr_forced='EPSG:27561',
..  )

Scheduler

To handle async rasters living in a Dataset, a thread is spawned to manage requests made to those rasters. It will start as soon as you create an async raster and stop when the Dataset is closed or collected. If one of your callback to be called by the scheduler raises an exception, the scheduler will stop and the exception will be propagated to the main thread as soon as possible.

Thread-safety

Thread safety is one of the main concern of buzzard. Everything is thread-safe except:

The raster write methods
The vector write methods
The raster read methods when using the GDAL::MEM driver
The vector read methods when using the GDAL::Memory driver

Parallel reads of rasters and vectors are natively supported in buzzard.

__del__()[source]¶

property close¶

Close the Dataset with a call or a context management. The close attribute returns an object that can be both called and used in a with statement

The Dataset can be closed manually or automatically when garbage collected, it is safer to do it manually.

The internal steps are:

Stopping the scheduler
Joining the mp.Pool that have been automatically allocated
Closing all sources

Examples

>>> ds = buzz.Dataset()
... # code...
... ds.close()

>>> with buzz.Dataset().close as ds
...     # code...

Caveat

When using a scheduler, some memory leaks may still occur after closing a Dataset. Possible origins:

https://bugs.python.org/issue34172
Gdal cache not flushed (not a leak)
The gdal version
https://stackoverflow.com/a/1316799 (not a leak)
Some unknown leak in the python threading or multiprocessing standard library
Some unknown library leaking memory on the C side
Some unknown library storing data in global variables

You can use a debug_observer with an on_object_allocated method to track large objects allocated in the scheduler. It will likely not be the source of the problem. If you even find a source of leaks please contact the buzzard team. https://github.com/earthcube-lab/buzzard/issues

__getitem__(key)[source]¶: Retrieve a source from its key

__contains__(item)[source]¶: Is key or source registered in Dataset

items()[source]¶: Generate the pair of (keys_of_source, source) for all proxies

keys()[source]¶: Generate all source keys

values()[source]¶: Generate all proxies

__len__()[source]¶: Retrieve source count registered within this Dataset

property proj4¶: Dataset’s work spatial reference in WKT proj4. Returns None if mode 1.

property wkt¶: Dataset’s work spatial reference in WKT format. Returns None if mode 1.

property active_count¶: Count how many driver objects are currently active

activate_all()[source]¶: Activate all deactivable proxies. May raise an exception if the number of sources is greater than max_activated

deactivate_all()[source]¶: Deactivate all deactivable proxies. Useful to flush all files to disk

property pools¶

Get the Pool Container.

>>> help(PoolsContainer)

Pool Container¶

class buzzard.PoolsContainer[source]¶

Manages thread/process pools and aliases for a Dataset

alias(key, pool_or_none)[source]¶

Register the given pool under the given key in this Dataset. The key can then be used to refer to that pool from within the async raster constructors.

Parameters

key: hashable (like a string)
pool_or_none: multiprocessing.pool.Pool or multiprocessing.pool.ThreadPool or None

manage(pool)[source]¶

Add the given pool to the list of pools that must be terminated upon Dataset closing.

Parameters

pool: multiprocessing.pool.Pool or multiprocessing.pool.ThreadPool

__len__()[source]¶: Number of pools registered in this Dataset

__iter__()[source]¶: Generator of pools registered in this Dataset

__getitem__(key)[source]¶: Pool or none getter from alias

__contains__(obj)[source]¶: Is pool or alias registered in this Dataset

Dataset¶

Dataset¶

Pool Container¶

Source Constructors¶