Iterate over a batch of events in OfflinePump
It's always a question of file-by-file or event-by-event processing. Sometimes many things can be done in a one-shot calculation over the whole file (in this case, a simple pump which iterates over files is preferable) and other times it's too complicated and one needs an event-by-event processing (OfflinePump
).
That's perfectly fine, but there are also cases where the interface is crucial: one-shot calculations which also need to work for a single event (when processing online streams).
I came up with the idea to support iterating over slices (while discussing with @adomi about online and offline processing of PID), which will basically "force" the user to write an interface which is designed to work with a batch of events and in case of online processing, it's just a special case of a single event in the batch.
The idea is to implement an option like batch_size=
in OfflinePump
so that it returns slices of max batch_size
in each iteration.
One possible problem though is that in this case group_id
will refer to the batches and not to a single file or single event. This is not a problem at all when people are not misusing the group_id
since that is just an implementation detail to be able to stitch together the same blob in each iteration.