Reference Manual and Format Specification

Version 1.0

Author:
Damian Eads

License

Copyright (C) 2004-2006 The Regents of the University of California.

Copyright (C) 2007 Los Alamos National Security, LLC.

This material was produced under U.S. Government contract DE-AC52-06NA25396 for Los Alamos National Laboratory (LANL), which is operated by Los Alamos National Security, LLC for the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this software. NEITHER THE GOVERNMENT NOR LOS ALAMOS NATIONAL SECURITY, LLC MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS SOFTWARE. If software is modified to produce derivative works, such modified software should be clearly marked, so as not to confuse it with the version available from LANL.

Additionally, this library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. Accordingly, this library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

Los Alamos Computer Code LA-CC-06-105

Installation and Compilation

The SIF library does not depend on any other libraries except for libc. The library has only been compiled on a few platforms, 32-bit machines running Visual Studio(R) on Windows 2000(R) as well as Windows XP(R) as well as GNU/Linux with large file support (64-bit) turned on. In theory, SIF should compile on platforms that do not support 64-bit files but this has not been tested. SIF has not been tested on Cygwin.

Building SIF on Windows

Open up the solution file sif.sol and click Build...Compile. This should build the sif.dll file, which you may put in a folder that is accessible by your PATH variable.

Installing SIF on Windows from Binaries

Download one of the binaries from the SIF website, http://public.lanl.gov/eads/sif.

Building SIF on Linux

We use the autotools, specifically automake and autoconf. First, your system needs to be inspected with autoconf by doing

  ./configure
Configuring without any options installs the library and include files to /usr/local by default. To change the prefix, do
  ./configure --prefix=/desired/prefix

After configuring your build environment, you are ready to build the library. The first command below builds the library while the second command installs it and the include files.

  make
  make install

Building against SIF in Linux

The compilation of your own code against SIF is not covered in this document since its a topic that others have certainly covered better than I could. However, there are two helpful points to note: SIF builds a shared library libsif.so, which it puts in prefix/lib and it puts a header file sif-io.h in prefix/include. You will need to include sif-io.h to have access to SIF functions and you will need to link against libsif.so. We provide a pkg-config file for your convenience called sif-io.pc, which is put in prefix/lib/pkgconfig. If you wish to use it, make sure your PKG_CONFIG_PATH includes that directory. Be sure your C_INCLUDE_PATH is set to include prefix/include and your LD_LIBRARY_PATH includes prefix/lib.

Introduction

The Sparse Image Format (SIF) is a file format for storing raster images with sparse pixel data. Images are broken down into a grid of tiles of fixed size. A tile is only stored in a file if any two pixels in it are different. This is particularly useful for images that are highly homogeneous in color. A few applications of SIF include using it to store: The SIF format is not intended for space-efficiently storing multispectral imagery or photos.

Basic Terminology

It is helpful to review some terminology used throughout this document since they are essential to understanding the basic operation of the library. Images are made up of bands and these bands are made up of pixels or data units. For example, an image of 32-bit floats with three bands, the data unit would be 32-bit float, and the data unit size, 4 bytes. A tile is a fixed-size cuboid of an image including all of its bands. A slice is a single band of a tile. The next figure illustrates all these primitive elements together.

slab.png

Every tile has a small tile header describing it: its uniformity and its byte offset location on disk (if applicable). A block refers to a unit of space on disk for storing a tile. The terms are different to easily differentiate between the physical entity of an image (a tile) with the storage space used to store it (block). A block need not contain a tile, in which case it is an unused block, and it can be reclaimed for later use. The next table shows how all of these image elements (tiles, blocks, slices) are related to one another and their relative sizes.

Relevant to Unit Size
Image image_width User-defined
Image image_height User-defined
Image/Tile/Block bands User-defined
Image image_wpixels ceil(image_width / tile_width) * tile_width
Image image_hpixels ceil(image_height / tile_height) * tile_height
Image image_data_units image_wpixels * image_hpixels * bands
Image image_bytes image_data_units * data_unit_size
Tile/Slice/Block tile_width User-defined
Tile/Slice/Block tile_height User-defined
Tile tile_data_units tile_width * tile_height * bands
Tile tile_bytes tile_pixels * data_unit_size
Block block_data_units block_width * block_height * bands
Block block_bytes block_pixels * data_unit_size
Slice slice_data_units tile_width * tile_height * 1
Slice slice_bytes slice_data_units * data_unit_size
Data Unit data_unit_size User-defined

Uniformity and Compression

Compression in the SIF format is quite simple. A slice is uniform if every data unit in the slice is the same and the common pixel value for the slice is called its uniform pixel value. A tile is uniform if all of its slices are uniform. Slices within the same tile need not have the same uniform pixel value.

The next figure shows an example of an image, tile headers for a few tiles, and the layout of various elements in the file. The red and white blocks correspond to used blocks and unused blocks, respectively. Unused blocks are occur when a previously used block is freed up. Let's examine five tiles (i, j, k, m, and n), their associated tile headers, and their placement on disk.

disklayout.png

Types of uniformity

There are three different kinds of uniformity, which are described below.

Some functions only check for shallow uniformity when performing their operations while others consider intrinsic uniformity. The sif_consolidate function helps free up data blocks by checking for uniformity of underlying block rasters, labeling them as shallow uniform if it finds them to be intrinsically uniform, and freeing up the disk blocks used by them. The sif_consolidate function also reduces external fragmentation by moving the used blocks to the front of the file and the unused blocks to the back. If the sif_header::consolidate flag is set, this consolidation process is performed on the file's closing.

Border tiles that overlap the image boundary

Sometimes the tile width does not divide the image width or the tile height does not divide the image height. The next Figure illustrates this. The border tile overlaps the border of the image. In cases such as these, only tile pixels within the image boundary are examined for uniformity. Border tiles cause some internal fragmentation but it is often negligible.

bordertile.png

Choosing tile dimensions

The choice of tile width and tile height depends on several factors. The number of bytes needed to store the tile header relative to the bytes needed to store the tile raster. It is also useful to characterize the kind of uniformity expected in the image like the largest non-uniform region size and the number of these regions.

Meta-data

The SIF format facilitates the storage of (key, value)-paired meta-data. A (key, value) pair is called a meta-data item. A meta-data item is referred to by its key and the element data is its value. Keys are case-sensitive.

SIF has two types of meta-data values, strings and binary byte sequences. String values must be represented by null-terminated character arrays. Binary byte sequences permit the storage of arbitrary data in a SIF file. String values are stored as binary byte sequences. If an attempt is made to retrieve a meta-data item as a string but the value is not a null-terminated byte sequence, an error is returned.

Reserved meta-data

Any meta-data key beginning with _sif_ is reserved for special meta-data, as defined by the SIF file format specification. The following reserved meta-data keys are currently in usage:

The following example sets a meta-data field on the SIF file pointed to by the file sif_file pointer,

  sif_set_meta_data(file, "model_file", "/afs/clue/gadm/833/hyper.model")

Now, let's retrieves it, and print it

  printf("model file: %s\n", sif_get_meta_data(file, "model_file"));

Now let's try to store an array V of 32 doubles using native byte order,

  sif_set_meta_binary(file, "my_32_doubles", V, sizeof(double) * 32);

Let's now retrieve it and print it out.

  double *buf;
  int nbytes, i;
  buf = sif_get_meta_data_binary(file, "my_32_doubles", &nbytes);
  if (nbytes != sizeof(double) * 32) {
    printf("Something bad happened.\n");
  }
  else {
    for (i = 0; i < 32; i++) {
      printf("my_double %d: %5.8f\n", i, buf[i]);
    }
  }

Caveat

Since a copy of all the meta-data in SIF file is always stored in memory, the meta-data feature is only intended for light use. In a future version, storage of a large meta-data footprint will be viable.

Pixel Data Types

The SIF file format does not establish a set of data types. The underlying pixel values are data-typeless to the SIF I/O library. The library permits the user to store a sif_header::user_data_type field but the field's value does not influence the behavior of SIF routines. This scheme of ignoring the underlying data type works if it can be guaranteed that two values pixel values p(x1,y1,b1) and p(x2,y2,b2) are the same if and only if the underlying byte sequences of these values are the same. Thus, SIF only needs to know the size of each underlying byte sequence that represents a single pixel value.

Agreement Meta-data String

Agreeing to a data type convention provides a guarantee that the data type of pixels can easily be determined. We encourage users to set the "_sif_agree" meta-data value to a string indicating the data type convention used. If "_sif_agree" meta-data string is not set, all bets are off, and no guarantees can be made about the data type of the pixels. Alternatively, the sif_get_agreement and sif_set_agreement functions can be used to get and set the agreement string.

We define one data type convention, "simple". The type codes for the sif_header::user_data_type field are defined below.

Value of user_data_type Corresponding Data Type
0 or SIF_SIMPLE_UINT8 unsigned char or uint8_t (little-endian)
1 or SIF_SIMPLE_INT8 char or int8_t
2 or SIF_SIMPLE_UINT16 uint16_t
3 or SIF_SIMPLE_INT16 int16_t
4 or SIF_SIMPLE_UINT32 uint32_t
5 or SIF_SIMPLE_INT32 int32_t
6 or SIF_SIMPLE_UINT64 uint64_t
7 or SIF_SIMPLE_INT64 int64_t
8 or SIF_SIMPLE_FLOAT32 IEEE-754 32-bit float
9 or SIF_SIMPLE_FLOAT64 IEEE-754 64-bit float

For example, suppose we've created a new SIF file with a data_unit_size of 4. Now let's write some code to indicate the "simple" convention: use unsigned 32-bit integers as the data type and use big-endian as the byte-order of the data units. Then, we'll print out these codes using functions that manipulate the compound type code (which we store in the sif_header::user_data_type in the file's header) to give the base type code (i.e. the data type irrespective of the byte order) and the endian code.

  int base_code = SIF_SIMPLE_UINT32;
  int endian_code = SIF_SIMPLE_BIG_ENDIAN;
  int compound_code = SIF_SIMPLE_TYPE_CODE(base_code, endian_code);
  sif_set_agreement(file, SIF_AGREEMENT_SIMPLE);
  sif_set_user_data_type(file, compound_code);
  printf("Base Data Type: %d\n"
         "Compound Data Type: %d\n"
         "Endian: %d\n", SIF_SIMPLE_BASE_TYPE_CODE(compound_code),
         compound_code, SIF_SIMPLE_ENDIAN(compound_code));

Alternatively, you can use the sif_simple_create function to create a file using the "simple" data type convention.

SIF File Layout

The SIF file begins with a fixed-size header followed by sif_header::n_tiles fixed-sized tile headers, followed by a variable number of fixed-sized blocks. Finally, the meta-data is written after all the data blocks. The file header and tile headers are put in the beginning of the file since their size does not change, although their values may change. This means that the large data blocks need not be moved forward in the file. Meta-data is written after the data blocks since the number of meta-data items can change; thus, the approach eliminates the need to move up data blocks after inexpensive meta-data operations. Unfortunately, there is a danger that after a new data block is allocated, there may not be enough space on the partition for the meta-data, and the file's meta-data cannot be safely written as read. This kind of data loss is uncommon if efforts are made to ensure adequate disk space is available to the scientific programs that use SIF.

Overall Layout of a SIF File.

The overall layout of a SIF file from the first byte to the last is shown in the following table. The file header remains of constant size and most header fields are immutable. The tile headers are of constant size since the data unit size and number of bands are immutable quantities. The inclusion of routines for changing tile dimensions, image dimensions, endianess, and data types is under currently under consideration. The block region is of variable size and precedes the meta-data region. Consequently, the location of the meta-data region changes as the size of the block region changes. The size of the meta-data region changes as meta-data fields are modified, added, or removed. It was anticipated that meta-data would be modified more infrequently than the data blocks. If the meta-data were to precede the block region, a small increase in the size of the meta-data region would result in the need to move the entire block region, which is costly when the block region is large. This is a casual justification for our choice of storing the meta-data after the block region. Also under consideration is the ability to store the meta-data before the block region, employing preallocation strategies to minimize moves of the block region due to meta-data region resizing.

File Header
Tile Header 1 (starts at header->header_bytes)
Tile Header 2
...
Tile Header n_tiles (starts at file->base_location)
Block 1
Block 2
...
Block n_blocks
Meta-data Item 1 (starts after the last byte of the last block)
Meta-data Item 2
...
Meta-data Item n_meta_data_items

File Header Byte Layout

The absolute byte offset for each file header field is shown in the next table. The second column is the name of the field as stored in the sif_header struct stored when a SIF file is opened. Integers and doubles are signed and stored in big-endian (or network) byte order. Note that in SIF Format Version code 1, doubles were stored little-endian but we realized this was confusing so this has been changed to big endian in versions 2 and higher. The header_bytes field enables the format to be changed without advancing the version code. Specifically, non-essential header fields can be added but they will be ignored by earlier versions of the I/O library.

The only fields that can change following the first write of a raster to an image are the defragmentation, consolidation, and intrinsic write flags as well as the georeferencing transform, and key count. If the caller wishes to change the image dimensions, data type, or tile dimensions after the first raster write, it must be done manually.

Absolute Offset Name Description Type
0 header_bytes The header size in bytes including the space needed for header_bytes. 32-bit int (b.e.)
4 magic_number The magic number "!**SIF**". 8 8-bit chars
12 version The version of the SIF file format used for the target file. This field is not the version of the SIF I/O library used to write the file. 32-bit int
16 width The width of the image in pixels. 32-bit int
20 height The height of the image in pixels. 32-bit int
24 bands The depth of the image in pixels. 32-bit int
28 n_keys The number of meta data fields stored. 32-bit int
32 n_tiles The number of tiles stored. 32-bit int
36 tile_width The width of each tile and slice in pixels. 32-bit int
40 tile_height The height of each tile and slice in pixels. 32-bit int
44 tile_bytes The number of bytes to store a single tile with all bands. 32-bit int
48 n_tiles_across The number of tiles for a single row of tiles on an image. 32-bit int
52 data_unit_size The size of a single data unit. 32-bit int
56 user_data_type A user-defined constant to represent the data type of the pixels, meaningful to the caller. 32-bit int
60 defragment When set, defragments the file during close. 32-bit int
64 consolidate When set, consolidates the file during close. 32-bit int
68 intrinsic_write When set, newly dirtied tiles are checked for intrinsic uniformity when written. 32-bit int
72 tile_hd_bytes The number of bytes to store a single tile header on disk. 32-bit int
76 n_unif_flags The number of bytes to store the uniform flags in the tile header. 32-bit int
80 aff_geo_trans The affine geo-referencing transform. Six 64-bit IEEE-754 doubles (b.e.)

Tile Header Byte Layout

The tile headers store information about the uniformity or non-uniformity of the block. If the tile is uniform, the uniform_pixel_value fields have meaning, and the i'th value is the uniform pixel value for the i'th. The block_num field is set to -1 if the tile header corresponds to a non-uniform tile. The first advancing index is the horizontal tile index and the second, the vertical tile index. This corresponds to how tiles are read and written from and to buffers in the image, i.e. the x coordinate of the pixels advances before the y.

Relative Offset (to the previous unit) Name Description Type
0 uniform_pixel_value[0] The value to fill band 0 if the tile is uniform. Otherwise, the value is meaningless User defined (of size data_unit_size)
data_unit_size uniform_pixel_value[1] The value to fill band 1 if the tile is uniform. Otherwise, the value is meaningless User defined (of size data_unit_size)
... ... ... ...
i*data_unit_size uniform_pixel_value[i] The value to fill band i if the tile is uniform. Otherwise, the value is meaningless User defined (of size data_unit_size)
... ... ... ...
(bands - 1) * data_unit_size uniform_pixel_value[bands - 1] The value to fill the last band if the tile is uniform. Otherwise, the value is meaningless User defined (of size data_unit_size)
r=bands * data_unit_size uniform_flags An array of bits, the i'th bit is TRUE if the i'th slice is uniform. h=ceil(bands/8) 8-bit characters
r+h block_num The block number where this tile is stored if it is non-uniform. This value is -1 if uniform. 32-bit int

Meta-Data Item Byte Layout

The meta-data item byte layout is simple. Again, integer length fields are assumed to be big-endian.

Relative Offset (to the previous unit) Name Description Type
0 key_length The number of bytes to store the key including the null terminator. 32-bit int (b.e.)
4 key The key as a string. key_length bytes
4+key_length value_length The number of bytes to store the value including the null terminator (if the value is non-binary). 32-bit int (b.e.)
8+key_length value The value as a byte sequence. value_length bytes

SIF Library and File Format Versions

Significant time was invested in designing the SIF file format. Yet, it is inevitable users will ask the author to make changes to it. This is a tricky road for several reasons. Changes that are only useful to a few users pose an issue where the rest of the user base may have compatibility issues when sharing their files since some users will choose to update their library while others will stick with older versions. Changes also add complexity to the unpacking logic in the I/O library, especially since an effort is made to effort ensure backwards compatibility of new versions of the library with older versions of the format. Therefore, my philosophy on changing and developing SIF is one that encourages improvements to the API over changes to the format.

The first release of the SIF I/O library (0.9) and SIF File Format (code 1) was internal while the second release (1.0 and code 2) was the first public release. Version 1 assumes integers in the header, tile headers, and meta-data headers are big-endian and doubles are little-endian. Realizing this was confusing, version 2 assumes doubles in the headers (namely sif_header::affine_geo_transform are also big-endian). Files can be written using older versions of the SIF File Format using the sif_use_file_format_version function.

The following table lists the file versions supported by each version of the SIF I/O library.
SIF Software Version Read Write
0.9 1 1
1.0 1-2 1-2

Image Pixel and Tile Header Index Computation

The number of pixels along the x-coordinate axis is given by the image width, and the number of pixels y-coordinate axis is given by the image height. Each tile is referenced by a tile coordinate, (tx, ty). tx is the index of the tile with respect to the x-coordinate axis and ty is the index of the tile with respect to the y-coordinate axis. The x index is the fastest advancing index; the y index, the second fastest advancing; and the band index, the slowest advancing index. The absolute byte offset q is computed from a pixel coordinate (x,y,b) as follows:

  q=(b * image_width * image_height) + (image_width * y + x)

The absolute tile index r is similarly computed,

  r=(n_tiles_across*ty)+tx

Error Checking and Reporting

The SIF library performs extensive error checking for I/O errors, memory allocation errors, and errors in the parameters passed to SIF functions. The sif_file::error code is set to 0 or SIF_ERROR_NONE during normal operation. No sif-io.h function resets this flag so the caller must do so if the error is deemed as non-fatal and the caller wishes to perform further operations on the file. Most SIF functions return immediately the first time an error is encountered. Memory allocated during an operation resulting in an error is deallocated prior to returning. The value returned by non-void functions during an error depends on the expected range of values for that function. If a pointer is usually returned, 0 is returned; if a positive number is usually returned, a non-positive is returned; or if a non-positive is usually returned, a number greater than zero is returned. For ease of coding, callers should test the sif_file::error flag rather than checking the return value because of the lack of consistency of return values when returning due to an error. The sif_get_error_description function returns a string description of an error code, which callers may conveniently use when reporting errors.

Testing for a valid SIF file

The sif_is_possibly_sif_file function checks whether a file could possibly be a SIF file. The present version only checks whether the magic number is valid. Future versions of the library will ensure:

Simple Convention Interface

Bundled with the SIF library are functions for manipulating SIF files conforming to the "simple" data type convention. These functions begin with sif_simple_.

Creating SIF Simple Files

The sif_simple_create and sif_simple_create_defaults functions are both used to create a SIF file conforming to the "simple" data type convention. The latter function sets defaults related to consolidation, uniformity checking, defragmentation, and tile size. Native byte order is used to store the image rasters; however all header fields, tile header fields, and meta-data length fields are all stored in big-endian byte order, regardless of the endian of the image rasters. The sif_simple_set_endian function must be called after creating the file and prior to performing any image I/O if the file's image endian is changed. Undefined behavior occurs when the endian field is changed after performing image I/O.

Image I/O

When data blocks are written to or read from a file, the blocks are converted to the appropriate byte order prior to writing to the file or after reading from it. sif_simple_ functions may not be used unless the file is opened with the sif_simple_create, sif_simple_create_defaults, or sif_simple_open function.

Rectangular Region I/O

The sif_simple_set_raster and sif_simple_get_raster functions are used to write and read a rectangular region, respectively. Only one band can be read or written at a time. The offsets and dimensions of the region are in pixel units, not tile units. The sif_simple_is_shallow_uniform checks whether the tiles comprising a rectangular region are stored as shallow uniform.

Tile Block I/O

The sif_simple_get_tile_slice and sif_simple_set_tile_slice functions read and write a slice. The sif_simple_fill_tile_slice function fills a slice with a constant value. The sif_is_slice_shallow_uniform function checks whether a slice is stored as shallow uniform in the file.

Checking for Conformity

The conformity of a file to the "simple" data type convention can be verified with the sif_is_simple_file or sif_is_simple_file_by_name functions. The first function assumes the file as already been opened with sif_open while the second accepts a filename.

Notes on Memory Preallocation

SIF allocates enough memory to hold two image blocks in memory for each open SIF file. When the sif_simple_open (for update), sif_simple_create, or sif_simple_create_defaults functions are used to open or create a SIF file conforming to the "simple" data type convention, a buffer is also allocated for converting the byte order of image rasters. The buffer is initially the size of a block. When a call is made to sif_simple_set_raster with a raster larger than the size of the buffer, the buffer is enlarged appropriately. Note that this buffer is not needed if the file is opened for read-only access, since the byte order conversion is performed on the caller's buffer. All of a file's memory buffers are deallocated during close.

Command Line Utilities

We provide the sif-util command for you to use to create, inspect and manipulate SIF files at a UNIX or DOS shell. The first argument is the name of the file to manipulate; the second argument, the name of the operation to perform; and the remaining arguments, the parameters of the operation.

  sif-util operation operation-args

sif-util Supported Operations

We now describe each of the operation supported by sif-util. Arguments are mandatory unless enclosed with square brackets. Indicated in parenthesis is whether the file must be writable to perform the operation.

A Note on PNM Output

When the PNM output operations region-to-pnm and tile-to-pnm are used, the pixel values are assumed to be of unsigned type. If the number of bands to write is 3, the PPM subformat is used with each of the three bands representing a separate color (band[0]=R, band[1]=G, band[2]=B). When writing a single band, the PGM format is used. If the image contains any other number of bands, the PAM format is used. The data unit size must not exceed 2 bytes, i.e. only uint8 and uint16 are supported. It is assumed the image raster is stored in native byte order if the file does not conform to the "simple" data type convention. The image raster is translated into proper ASCII decimal form (PPM or PGM format) or big-endian byte order (PAM format) prior to being outputted.


Generated on Tue Dec 4 11:02:10 2007 for SIF by  doxygen 1.4.7