Should we change Dataset to adaptively create the correct subclass depending on the input path?

Created by: AMHermansen

At the meeting today I mentioned that we could use hooks from __new__ and __init_subclass__ to automatically infer the desired dataset-subclass.

Currently we have two Datasets which are being used SQLiteDataset and ParquetDataset, and they both take the same inputs arguments.

I think we could overwrite __new__ and __init_subclass__ in Dataset in such a way that __init_subclass__ would make a "subclass registry", which connects file-extensions to implemented datasets, and then __new__ would look up the subclass registry and find the correct subclass to instantiate.

A "simple" illustration of how this would look like

from typing import Iterable, Union


class A:
    _subclass_registry = {}
    def __init__(self, arg1, arg2, kwarg1=None, kwarg2=None, *, path: str):
        self.path = path
        self.arg1 = arg1
        self.arg2 = arg2
        self.kwarg1 = kwarg1
        self.kwarg2 = kwarg2

    def __init_subclass__(cls, file_extensions: Union[str, Iterable[str]], **kwargs):
        if isinstance(file_extensions, str):
            file_extensions = [file_extensions]
        for ext in file_extensions:
            if ext in cls._subclass_registry:
                raise ValueError(f"Duplicate file extension: {ext}")
            A._subclass_registry[ext] = cls
        super().__init_subclass__(**kwargs)

    def __new__(cls, *args, **kwargs):
        path = kwargs["path"]
        file_extension = path.split(".")[-1]
        subclass = cls._subclass_registry.get(file_extension, None)
        if subclass is None:
            raise ValueError(f"Unknown file extension: {file_extension}")
        return object.__new__(subclass)


class B(A, file_extensions="ext1"):
    def __init__(self, arg1, arg2, kwarg1=None, kwarg2=None, *, path: str):
        super().__init__(arg1, arg2, kwarg1=kwarg1, kwarg2=kwarg2, path=path)
        print(f"Created B instance with path: {self.path}")


class C(A, file_extensions=["ext2", "ext3"]):
    def __init__(self, arg1, arg2, kwarg1=None, kwarg2=None, *, path: str):
        super().__init__(arg1, arg2, kwarg1=kwarg1, kwarg2=kwarg2, path=path)
        print(f"Created C instance with path: {self.path}")


if __name__ == "__main__":
    a = A(1, 2, path="file.ext1")  # Creates object from class B
    b = A(3, 4, path="file.ext2")  # Creates object from class C
    c = A(5, 6, path="file.ext3")  # Creates object from class C
    print(f"{type(a)=}")
    print(f"{type(b)=}")
    print(f"{type(c)=}")

Pros:

I think it would be easier for end-users to interact with the library, since they only need to create a Dataset, and don't need to worry about finding the correct subclass for their data back-end.

Cons:

We restrict ourselves to have all datasets take the same arguments/keyword arguments
We restrict ourselves to only have one dataset per file extension. (This can somewhat be circumvented, but it is not as elegant)
It might be slightly more difficult to debug a dataset object, because it is not completely straightforward which subclass it is.