TPIE

2362a60
Serialization streams

Motivation

If you want to read and write text strings with TPIE file_streams, the interface requires a fixed string size. In some cases this may be unreasonable: space is wasted on strings that are smaller than the given size limit, and it may be impossible to give a fixed upper bound on the length of the strings a program has to operate on.

For this, TPIE provides a serialization framework with a distinct set of stream readers and writers that support, in essence, variable-length item types, such as strings and arrays. With the library support for reversing and sorting such serialization streams, it becomes reasonably easy to implement external memory algorithms operating on variable length items.

The goal of TPIE serialization is not to be portable across machines, nor is it to provide type-checking on the serialized input. We do not track endianness or integer widths, so it is not in general supported to read serialized streams written on a different platform. Indeed, the motivation for TPIE serialization is to support temporary streams of variable-width items in external memory; it is not intended as a persistent store or as a data transfer format.

TPIE serialization has built-in support for plain old data, also known as POD types. This built-in POD support excludes pointer types, however. POD types are serialized and unserialized by their in-memory representation. This is intended to be fast, not safe or portable.

The framework also supports certain library types out of the box, such as std::vector, std::string and plain old arrays of serializable data.

Usage

The interface and usage is straightforward. See the included test program lines, the bulk of which is reproduced below.

void write_lines(std::istream & is, std::string filename) {
std::string line;
wr.open(filename);
while (std::getline(is, line)) {
wr.serialize(line);
}
wr.close();
}
void reverse_lines(std::string filename) {
{
rd.open(filename);
wr.open(f);
while (rd.can_read()) {
std::string line;
rd.unserialize(line);
wr.serialize(line);
}
wr.close();
rd.close();
}
{
rd.open(f);
wr.open(filename);
while (rd.can_read()) {
std::string line;
rd.unserialize(line);
wr.serialize(line);
}
wr.close();
rd.close();
}
}
void read_lines(std::ostream & os, std::string filename) {
rd.open(filename);
while (rd.can_read()) {
std::string line;
rd.unserialize(line);
os << line << '\n';
}
rd.close();
}
void sort_lines(std::string filename) {
sorter.set_available_memory(50*1024*1024);
sorter.begin();
{
rd.open(filename);
while (rd.can_read()) {
std::string line;
rd.unserialize(line);
sorter.push(line);
}
rd.close();
}
sorter.end();
sorter.merge_runs();
{
wr.open(filename);
while (sorter.can_pull()) {
wr.serialize(sorter.pull());
}
wr.close();
}
}

User-supplied serializable types

For types other than those supported natively by TPIE serialization, the user can provide implementations of the serialize and unserialize procedures. For example, we can implement simple serialization/unserialization of a point type:

namespace userland {
struct point2 {
double x;
double y;
};
template <typename Dst>
void serialize(Dst & d, const point2 & pt) {
serialize(d, pt.x);
serialize(d, pt.y);
}
template <typename Src>
void unserialize(Src & s, point2 & pt) {
unserialize(s, pt.x);
unserialize(s, pt.y);
}
} // namespace userland

For a more complicated example, consider how we might serialize and unserialize a std::vector.

template <typename D, typename T, typename alloc_t>
void serialize(D & dst, const std::vector<T, alloc_t> & v) {
serialize(dst, v.size());
serialize(dst, v.begin(), v.end());
}
template <typename S, typename T, typename alloc_t>
void unserialize(S & src, std::vector<T, alloc_t> & v) {
typename std::vector<T>::size_type s;
unserialize(src, s);
v.resize(s);
unserialize(src, v.begin(), v.end());
}