However, even at this point, is it generally better to do a bunch of seeks and blind writes, or to read chunks of the file manually, update the appropriate pointers, and write the chunks back out? While the former method is much simpler and could be optimized by the OS to do the bare minimum of sector reads (since it knows the sector size) and copies (it can avoid copies by reading directly into properly aligned buffers), it seems that it will incur extremely high syscall overhead. This has the benefit of greatly increasing the chances that the appropriate records will be cached for a cluster after updating its first pointer. I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory. Assuming the file is much larger than the OS disk cache, doing a bunch of random seeks and writes seems extremely inefficient. Since the list of joined records is sorted by key, the associated record numbers are essentially random. My question is what is the fastest way to actually do this?
To complete the join, I conceptually need to seek to each A record in the list and write the corresponding B record number at the appropriate offset, and vice versa. Iterating through a list pair and matching up the keys yields a list of (key, record number A, record number B) tuples representing the joined records (assuming a 1:1 mapping for simplicity). To elaborate, I have a pair of lists of (key, record number) tuples sorted by key for each join I want to perform on a given pair of files, say, A and B. 64-bit record numbers) into those records at specific offsets. I want to (efficiently) join them to records in other files by writing pointers (i.e. I have some very large (>4 GB) files containing (millions of) fixed-length binary records.