Rasters Sources Using Recipes

Dataset.create_raster_recipe(key, fp, dtype, channel_count, channels_schema=None, sr=None, compute_array=None, merge_arrays=<function concat_arrays>, queue_data_per_primitive=mappingproxy({}), convert_footprint_per_primitive=None, computation_pool='cpu', merge_pool='cpu', resample_pool='cpu', computation_tiles=None, max_computation_size=None, max_resampling_size=None, automatic_remapping=True, debug_observers=())[source]

Warning

This method is not yet implemented. It exists for documentation purposes.

Create a raster recipe and register it under key within this Dataset.

A raster recipe implements the same interfaces as all other rasters, but internally it computes data on the fly by calling a callback. The main goal of the raster recipes is to provide a boilerplate-free interface that automatize those cumbersome tasks:

  • tiling,

  • parallelism

  • caching

  • file reads

  • resampling

  • lazy evaluation

  • backpressure prevention and

  • optimised task scheduling.

If you are familiar with create_cached_raster_recipe two parameters are new here: automatic_remapping and max_computation_size.

Parameters

key:

see Dataset.create_raster()

fp:

see Dataset.create_raster()

dtype:

see Dataset.create_raster()

channel_count:

see Dataset.create_raster()

channels_schema:

see Dataset.create_raster()

sr:

see Dataset.create_raster()

compute_array: callable

see Computation Function below

merge_arrays: callable

see Merge Function below

queue_data_per_primitive: dict of hashable (like a string) to a queue_data method pointer

see Primitives below

convert_footprint_per_primitive: None or dict of hashable (like a string) to a callable

see Primitives below

computation_pool:

see Pools below

merge_pool:

see Pools below

resample_pool:

see Pools below

computation_tiles: None or (int, int) or numpy.ndarray of Footprint

see Computation Tiling below

max_computation_size: None or int or (int, int)

see Computation Tiling below

max_resampling_size: None or int or (int, int)

Optionally define a maximum resampling size. If a larger resampling has to be performed, it will be performed tile by tile in parallel.

automatic_remapping: bool

see Automatic Remapping below

debug_observers: sequence of object

Entry points that observe what is happening with this raster in the Dataset’s scheduler.

Returns

source: NocacheRasterRecipe

Computation Function

The function that will map a Footprint to a numpy.ndarray. If queue_data_per_primitive is not empty, it will map a Footprint and primitive arrays to a numpy.ndarray.

It will be called in parallel according to the computation_pool parameter provided at construction.

The function will be called with the following positional parameters:

  • fp: Footprint of shape (Y, X)

    The location at which the pixels should be computed

  • primitive_fps: dict of hashable to Footprint

    For each primitive defined through the queue_data_per_primitive parameter, the input Footprint.

  • primitive_arrays: dict of hashable to numpy.ndarray

    For each primitive defined through the queue_data_per_primitive parameter, the input numpy.ndarray that was automatically computed.

  • raster: CachedRasterRecipe or None

    The Raster object of the ongoing computation.

It should return either:

  • a single ndarray of shape (Y, X) if only one channel was computed
  • a single ndarray of shape (Y, X, C) if one or more channels were computed

If computation_pool points to a process pool, the compute_array function must be picklable and the raster parameter will be None.

Computation Tiling

You may sometimes want to have control on the Footprints that are requested to the compute_array function, for example:

  • If pixels computed by compute_array are long to compute, you want to tile to increase parallelism.

  • If the compute_array function scales badly in term of memory or time, you want to tile to reduce complexity.

  • If compute_array can work only on certain Footprints, you want a hard constraint on the set of Footprint that can be queried from compute_array. (This may happen with convolutional neural networks)

To do so use the computation_tiles or max_computation_size parameter (not both).

If max_computation_size is provided, a Footprint to be computed will be tiled given this parameter.

If computation_tiles is a numpy.ndarray of Footprint, it should be a tiling of the fp parameter. Only the Footprints contained in this tiling will be asked to the computation_tiles. If computation_tiles is (int, int), a tiling will be constructed using Footprint.tile using those two ints.

Merge Function

The function that will map several pairs of Footprint/numpy.ndarray to a single numpy.ndarray. If the computation_tiles is None, it will never be called.

It will be called in parallel according to the merge_pool parameter provided at construction.

The function will be called with the following positional parameters:

  • fp: Footprint of shape (Y, X)

    The location at which the pixels should be computed.

  • array_per_fp: dict of Footprint to numpy.ndarray

    The pairs of Footprint/numpy.ndarray of each arrays that were computed by compute_array and that overlap with fp.

  • raster: CachedRasterRecipe or None

    The Raster object of the ongoing computation.

It should return either:

  • a single ndarray of shape (Y, X) if only one channel was computed
  • a single ndarray of shape (Y, X, C) if one or more channels were computed

If merge_pool points to a process pool, the merge_array function must be picklable and the raster parameter will be None.

Automatic Remapping

When creating a recipe you give a Footprint through the fp parameter. When calling your compute_array function the scheduler will only ask for slices of fp. This means that the scheduler takes care of those boilerplate steps:

  • If you request a Footprint on a different grid in a get_data() call, the scheduler takes care of resampling the outputs of your compute*array function.

  • If you request a Footprint partially or fully outside of the raster’s extent, the scheduler will call your compute_array function to get the interior pixels and then pad the output with nodata.

This system is flexible and can be deactivated by passing automatic_remapping=False to the constructor of a NocacheRasterRecipe, in this case the scheduler will call your compute_array function for any kind of Footprint; thus your function must be able to comply with any request.

Primitives

The queue_data_per_primitive and convert_footprint_per_primitive parameters can be used to create dependencies between dependee async rasters and the raster recipe being created. The dependee/dependent relation is called primitive/derived throughout buzzard. A derived recipe can itself be the primitive of another raster. Pipelines of any depth and width can be instanciated that way.

In queue_data_per_primitive you declare a dependee by giving it a key of your choice and the pointer to the queue_data method of dependee raster. You can parameterize the connection by currying the channels, dst_nodata, interpolation and max_queue_size parameters using functools.partial.

The convert_footprint_per_primitive dict should contain the same keys as queue_data_per_primitive. A value in the dict should be a function that maps a Footprint to another Footprint. It can be used for example to request larger rectangles of primitives data to compute a derived array.

e.g. If the primitive raster is an rgb image, and the derived raster only needs the green channel but with a context of 10 additional pixels on all 4 sides:

>>> derived = ds.create_raster_recipe(
...     # <other parameters>
...     queue_data_per_primitive={'green': functools.partial(primitive.queue_data, channels=1)},
...     convert_footprint_per_primitive={'green': lambda fp: fp.dilate(10)},
... )

Pools

The *_pool parameters can be used to select where certain computations occur. Those parameters can be of the following types:

  • A multiprocessing.pool.ThreadPool, should be the default choice.

  • A multiprocessing.pool.Pool, a process pool. Useful for computations that requires the GIL or that leaks memory.

  • None, to request the scheduler thread to perform the tasks itself. Should be used when the computation is very light.

  • A hashable (like a string), that will map to a pool registered in the Dataset. If that key is missing from the Dataset, a ThreadPool with multiprocessing.cpu_count() workers will be automatically instanciated. When the Dataset is closed, the pools instanciated that way will be joined.

See Also

Dataset.create_cached_raster_recipe(key, fp, dtype, channel_count, channels_schema=None, sr=None, compute_array=None, merge_arrays=<function concat_arrays>, cache_dir=None, ow=False, queue_data_per_primitive=mappingproxy({}), convert_footprint_per_primitive=None, computation_pool='cpu', merge_pool='cpu', io_pool='io', resample_pool='cpu', cache_tiles=(512, 512), computation_tiles=None, max_resampling_size=None, debug_observers=())[source]

Create a cached raster recipe and register it under key within this Dataset.

Compared to a NocacheRasterRecipe, in a CachedRasterRecipe the pixels are never computed twice. Cache files are used to store and reuse pixels from computations. The cache can even be reused between python sessions.

If you are familiar with create_raster_recipe four parameters are new here: io_pool, cache_tiles, cache_dir and ow. They are all related to file system operations.

See create_raster_recipe method, since it shares most of the features:

>>> help(CachedRasterRecipe)

Parameters

key:

see Dataset.create_raster() method

fp:

see Dataset.create_raster() method

dtype:

see Dataset.create_raster() method

channel_count:

see Dataset.create_raster() method

channels_schema:

see Dataset.create_raster() method

sr:

see Dataset.create_raster() method

compute_array:

see Dataset.create_raster_recipe() method

merge_arrays:

see Dataset.create_raster_recipe() method

cache_dir: str or pathlib.Path

Path to the directory that holds the cache files associated with this raster. If cache files are present, they will be reused (or erased if corrupted). If a cache file is needed and missing, it will be computed.

ow: bool

Overwrite. Whether or not to erase the old cache files contained in cache_dir.

Warning

not only the tiles needed (hence computed) but all buzzard cache files in cache_dir will be deleted.

queue_data_per_primitive:

see Dataset.create_raster_recipe() method

convert_footprint_per_primitive:

see Dataset.create_raster_recipe() method

computation_pool:

see Dataset.create_raster_recipe() method

merge_pool:

see Dataset.create_raster_recipe() method

io_pool:

see Dataset.create_raster_recipe() method

resample_pool:

see Dataset.create_raster_recipe() method

cache_tiles: (int, int) or numpy.ndarray of Footprint

A tiling of the fp parameter. Each tile will correspond to one cache file. if (int, int): Construct the tiling by calling Footprint.tile with this parameter

computation_tiles:

if None: Use the same tiling as cache_tiles else: see create_raster_recipe method

max_resampling_size: None or int or (int, int)

see Dataset.create_raster_recipe() method

debug_observers: sequence of object

see Dataset.create_raster_recipe() method

Returns

source: CachedRasterRecipe

See Also

Dataset.acreate_cached_raster_recipe(fp, dtype, channel_count, channels_schema=None, sr=None, compute_array=None, merge_arrays=<function concat_arrays>, cache_dir=None, ow=False, queue_data_per_primitive=mappingproxy({}), convert_footprint_per_primitive=None, computation_pool='cpu', merge_pool='cpu', io_pool='io', resample_pool='cpu', cache_tiles=(512, 512), computation_tiles=None, max_resampling_size=None, debug_observers=())[source]

Create a cached raster reciped anonymously within this Dataset.

See Dataset.create_cached_raster_recipe

See Also