class buzzard.Dataset(sr_work=None, sr_fallback=None, sr_forced=None, analyse_transformation=True, allow_none_geometry=False, allow_interpolation=False, max_active=inf, debug_observers=(), **kwargs)[source]

Dataset is a class that stores references to sources. A source is either a raster, or a vector. A Dataset allows:

  • quick manipulations by optionally assigning a key to each registered source, (see Sources Registering below)

  • closing all source at once by closing the Dataset object.

But also inter-sources operations, like:

For actions specific to opened sources, see those classes:


This class is not equivalent to the gdal.Dataset class.


sr_work: None or string

In order to set a spatial reference, use a string that can be converted to WKT by GDAL.

(see On the fly re-projections in buzzard below)

sr_fallback: None or string

In order to set a spatial reference, use a string that can be converted to WKT by GDAL.

(see On the fly re-projections in buzzard below)

sr_forced: None or string

In order to set a spatial reference, use a string that can be converted to WKT by GDAL.

(see On the fly re-projections in buzzard below)

analyse_transformation: bool

Whether or not to perform a basic analysis on two sr to check their compatibility.

if True: Read the buzz.env.significant variable and raise an exception if a spatial reference conversions is too lossy in precision.

if False: Skip all checks.

(see On the fly re-projections in buzzard below)

allow_none_geometry: bool

Whether or not a vector geometry should raise an exception when encountering a None geometry

allow_interpolation: bool

Whether or not a raster geometry should raise an exception when remapping with interpolation is necessary.

max_active: nbr >= 1

Maximum number of pooled sources active at the same time. (see Sources activation / deactivation below)

debug_observers: sequence of object

Entry points to observe what is happening in the Dataset’s sheduler.


>>> import buzzard as buzz

Creating a Dataset.

>>> ds = buzz.Dataset()

Opening a file and registering it under the ‘roofs’ key. There are four ways to the access an opened source.

>>> r = ds.open_vector('roofs', 'path/to/roofs.shp')
... feature_count = len(ds.roofs)
... feature_count = len(ds['roofs'])
... feature_count = len(ds.get('roofs'))
... feature_count = len(r)

Opening a file anonymously. There is only one way to access the source.

>>> r = ds.aopen_raster('path/to/dem.tif')
... data_type = r.dtype

Opening, reading and closing two raster files with context management.

>>> with ds.open_raster('rgb', 'path/to/rgb.tif').close:
...     data_type = ds.rgb.fp
...     arr = ds.rgb.get_data()
>>> with ds.aopen_raster('path/to/rgb.tif').close as rgb:
...     data_type = rgb.dtype
...     arr = rgb.get_data()

Creating two files

>>> ds.create_vector('targets', 'path/to/targets.geojson', 'point', driver='GeoJSON')
... geometry_type = ds.targets.type
>>> with ds.acreate_raster('/tmp/cache.tif', ds.dem.fp, 'float32', 1).delete as cache:
...     file_footprint = cache.fp
...     cache.set_data(dem.get_data())

Sources Types

Sources Registering

There are always two ways to create a source, with a key or anonymously.

When creating a source using a key, said key (e.g. the string “my_source_name”) must be provided by user. Each key identify one source and should thus be unique. There are then three ways to access that source:

  • from the object returned by the method that created the source,

  • from the Dataset with the attribute syntax: ds.my_source_name,

  • from the Dataset with the item syntax: ds[“my_source_name”].

All keys should be unique.

When creating a source anonymously you don’t have to provide a key, but the only way to access this source is to use the object returned by the method that created the source.

Sources activation / deactivation

The sources that inherit from APooledEmissary (like GDALFileVector and GDALFileRaster) are flexible about their underlying driver object. Those sources may be temporary deactivated (useful to limit the number of file descriptors active), or activated multiple time at the same time (useful to perfom concurrent reads).

Those sources are automatically activated and deactivated given the current needs and constraints. Setting a max_active lower than np.inf in the Dataset constructor, will ensure that no more than max_active driver objects are active at the same time, by deactivating the LRU ones.

On the fly re-projections in buzzard

A Dataset may perform spatial reference conversions on the fly, like a GIS does. Several modes are available, a set of rules define how each mode work. Those conversions concern both read operations and write operations, all are performed by the OSR library.

Those conversions are only perfomed on vector’s data/metadata and raster’s Footprints. This implies that classic raster warping is not included (yet) in those conversions, only raster shifting/scaling/rotation work.

The z coordinates of vectors geometries are also converted, on the other hand elevations are not converted in DEM rasters.

If analyse_transformation is set to True (default), all coordinates conversions are tested against buzz.env.significant on file opening to ensure their feasibility or raise an exception otherwise. This system is naive and very restrictive, use with caution. Although, disabling those tests is not recommended, ignoring floating point precision errors can create unpredictable behaviors at the pixel level deep in your code. Those bugs can be witnessed when zooming to infinity with tools like qgis or matplotlib.

On the fly re-projections in buzzard - Terminology


Spatial reference


The sr of all interactions with a Dataset (i.e. Footprints, extents, Polygons…), may be None.


The sr that can be found in the metadata of a raster/vector storage, may be None.


The sr considered to be written in the metadata of a raster/vector storage, it is often the same as sr_stored. When a raster/vector is read, a conversion is performed from sr_virtual to sr_work. When writing vector data, a conversion is performed from sr_work to sr_virtual.


A sr_virtual provided by user to ignore all sr_stored. This is for example useful when the sr stored in the input files are corrupted.


A sr_virtual provided by user to be used when sr_stored is missing. This is for example useful when an input file can’t store a sr (e.g. DFX).

On the fly re-projections in buzzard - Dataset parameters and modes





How is the sr_virtual of a source determined





Use sr_stored, no conversion is performed for the lifetime of this Dataset





Use sr_stored, if None raises an exception





Use sr_stored, if None it is considered to be sr_fallback





Use sr_forced

On the fly re-projections in buzzard - Use cases

  • If all opened files are known to be written in a same sr in advance, use mode 1.

    No conversions will be performed, this is the safest way to work.

  • If all opened files are known to be written in the same sr but you wish to work in a different sr, use mode 4.

    The huge benefit of this mode is that the driver specific behaviors concerning spatial references have no impacts on the data you manipulate.

  • On the other hand if you don’t have a priori information on files’ sr, mode 2 or mode 3 should be used.


    Side note: Since the GeoJSON driver cannot store a sr, it is impossible to open or create a GeoJSON file in mode 2.

On the fly re-projections in buzzard - Examples

mode 1 - No conversions at all

>>> ds = buzz.Dataset()

mode 2 - Working with WGS84 coordinates

>>> ds = buzz.Dataset(
...     sr_work='WGS84',
... )

mode 3 - Working in UTM with DXF files in WGS84 coordinates

>>> ds = buzz.Dataset(
...     sr_work='EPSG:32632',
...     sr_fallback='WGS84',
... )

mode 4 - Working in UTM with unreliable LCC input files

>>> ds = buzz.Dataset(
...     sr_work='EPSG:32632',
...     sr_forced='EPSG:27561',
..  )


To handle async rasters living in a Dataset, a thread is to manage requests made to those rasters. It will start as soon as you create an async raster and stop when the Dataset is closed or collected. If one of your callbacks to be called by the scheduler raises an exception, the scheduler will stop and the exception will be propagated to the main thread as soon as possible.


Thread safety is one of the main concern of buzzard. Everything is thread-safe except:

  • The raster write methods

  • The vector write methods

  • The raster read methods when using the GDAL::MEM driver

  • The vector read methods when using the GDAL::Memory driver

property close

Close the Dataset with a call or a context management. The close attribute returns an object that can be both called and used in a with statement

The Dataset can be closed manually or automatically when garbage collected, it is safer to do it manually.

The internal steps are:

  • Stopping the scheduler

  • Joining the mp.Pool that have been automatically allocated

  • Closing all sources


>>> ds = buzz.Dataset()
... # code...
... ds.close()
>>> with buzz.Dataset().close as ds
...     # code...


When using a scheduler, some memory leaks may still occur after closing a Dataset. Possible origins:

  • (update your python to >=3.6.7)

  • Gdal cache not flushed (not a leak)

  • The gdal version

  • (not a leak)

  • Some unknown leak in the python threading or multiprocessing standard library

  • Some unknown library leaking memory on the C side

  • Some unknown library storing data in global variables

You can use a debug_observer with an on_object_allocated method to track large objects allocated in the scheduler. It will likely not be the source of the problem. If you even find a source of leaks please contact the buzzard team.


Retrieve a source from its key


Is key or source registered in Dataset


Generate the pair of (keys_of_source, source) for all proxies


Generate all source keys


Generate all proxies


Retrieve source count registered within this Dataset

property proj4

Dataset’s work spatial reference in WKT proj4. Returns None if mode 1.

property wkt

Dataset’s work spatial reference in WKT format. Returns None if mode 1.

property active_count

Count how many driver objects are currently active


Activate all deactivable proxies. May raise an exception if the number of sources is greater than max_activated


Deactivate all deactivable proxies. Useful to flush all files to disk

property pools

Get the Pool Container.

>>> help(PoolsContainer)

Pool Container

class buzzard.PoolsContainer[source]

Manages thread/process pools and aliases for a Dataset

alias(key, pool_or_none)[source]

Register the given pool under the given key in this Dataset. The key can then be used to refer to that pool from within the async raster constructors.


key: hashable (like a string)
pool_or_none: multiprocessing.pool.Pool or multiprocessing.pool.ThreadPool or None

Add the given pool to the list of pools that must be terminated upon Dataset closing.


pool: multiprocessing.pool.Pool or multiprocessing.pool.ThreadPool

Number of pools registered in this Dataset


Generator of pools registered in this Dataset


Pool or none getter from alias


Is pool or alias registered in this Dataset