Module re_arrow2::io::ipc

source ·
Expand description

APIs to read from and write to Arrow’s IPC format.

Inter-process communication is a method through which different processes share and pass data between them. Its use-cases include parallel processing of chunks of data across different CPU cores, transferring data between different Apache Arrow implementations in other languages and more. Under the hood Apache Arrow uses FlatBuffers as its binary protocol, so every Arrow-centered streaming or serialiation problem that could be solved using FlatBuffers could probably be solved using the more integrated approach that is exposed in this module.

Arrow’s IPC protocol allows only batch or dictionary columns to be passed around due to its reliance on a pre-defined data scheme. This constraint provides a large performance gain because serialized data will always have a known structutre, i.e. the same fields and datatypes, with the only variance being the number of rows and the actual data inside the Batch. This dramatically increases the deserialization rate, as the bytes in the file or stream are already structured “correctly”.

Reading and writing IPC messages is done using one of two variants - either FileReader <-> FileWriter or StreamReader <-> StreamWriter. These two variants wrap a type T that implements Read, and in the case of the File variant it also implements Seek. In practice it means that Files can be arbitrarily accessed while Streams are only read in certain order - the one they were written in (first in, first out).

§Examples

Read and write to a file:

use re_arrow2::io::ipc::{{read::{FileReader, read_file_metadata}}, {write::{FileWriter, WriteOptions}}};
// Setup the writer
let path = "example.arrow".to_string();
let mut file = File::create(&path)?;
let x_coord = Field::new("x", DataType::Int32, false);
let y_coord = Field::new("y", DataType::Int32, false);
let schema = Schema::from(vec![x_coord, y_coord]);
let options = WriteOptions {compression: None};
let mut writer = FileWriter::try_new(file, schema, None, options)?;

// Setup the data
let x_data = Int32Array::from_slice([-1i32, 1]);
let y_data = Int32Array::from_slice([1i32, -1]);
let chunk = Chunk::try_new(vec![x_data.boxed(), y_data.boxed()])?;

// Write the messages and finalize the stream
for _ in 0..5 {
    writer.write(&chunk, None);
}
writer.finish();

// Fetch some of the data and get the reader back
let mut reader = File::open(&path)?;
let metadata = read_file_metadata(&mut reader)?;
let mut reader = FileReader::new(reader, metadata, None, None);
let row1 = reader.next().unwrap();  // [[-1, 1], [1, -1]]
let row2 = reader.next().unwrap();  // [[-1, 1], [1, -1]]
let mut reader = reader.into_inner();
// Do more stuff with the reader, like seeking ahead.

For further information and examples please consult the user guide. For even more examples check the examples folder in the main repository (1, 2, 3).

Modules§

  • A struct adapter of Read+Seek+Write to append to IPC files
  • APIs to read Arrow’s IPC format.
  • APIs to write to Arrow’s IPC format.

Structs§

  • Struct containing dictionary_id and nested IpcField, allowing users to specify the dictionary ids of the IPC fields when writing to IPC.
  • Struct containing fields and whether the file is written in little or big endian.