Should we change Dataset to adaptively create the correct subclass depending on the input path?
Created by: AMHermansen
At the meeting today I mentioned that we could use hooks from __new__ and __init_subclass__ to automatically infer the desired dataset-subclass.
Currently we have two Datasets which are being used SQLiteDataset and ParquetDataset, and they both take the same inputs arguments.
I think we could overwrite __new__ and __init_subclass__ in Dataset in such a way that __init_subclass__ would make a "subclass registry", which connects file-extensions to implemented datasets, and then __new__ would look up the subclass registry and find the correct subclass to instantiate.
A "simple" illustration of how this would look like
from typing import Iterable, Union
class A:
_subclass_registry = {}
def __init__(self, arg1, arg2, kwarg1=None, kwarg2=None, *, path: str):
self.path = path
self.arg1 = arg1
self.arg2 = arg2
self.kwarg1 = kwarg1
self.kwarg2 = kwarg2
def __init_subclass__(cls, file_extensions: Union[str, Iterable[str]], **kwargs):
if isinstance(file_extensions, str):
file_extensions = [file_extensions]
for ext in file_extensions:
if ext in cls._subclass_registry:
raise ValueError(f"Duplicate file extension: {ext}")
A._subclass_registry[ext] = cls
super().__init_subclass__(**kwargs)
def __new__(cls, *args, **kwargs):
path = kwargs["path"]
file_extension = path.split(".")[-1]
subclass = cls._subclass_registry.get(file_extension, None)
if subclass is None:
raise ValueError(f"Unknown file extension: {file_extension}")
return object.__new__(subclass)
class B(A, file_extensions="ext1"):
def __init__(self, arg1, arg2, kwarg1=None, kwarg2=None, *, path: str):
super().__init__(arg1, arg2, kwarg1=kwarg1, kwarg2=kwarg2, path=path)
print(f"Created B instance with path: {self.path}")
class C(A, file_extensions=["ext2", "ext3"]):
def __init__(self, arg1, arg2, kwarg1=None, kwarg2=None, *, path: str):
super().__init__(arg1, arg2, kwarg1=kwarg1, kwarg2=kwarg2, path=path)
print(f"Created C instance with path: {self.path}")
if __name__ == "__main__":
a = A(1, 2, path="file.ext1") # Creates object from class B
b = A(3, 4, path="file.ext2") # Creates object from class C
c = A(5, 6, path="file.ext3") # Creates object from class C
print(f"{type(a)=}")
print(f"{type(b)=}")
print(f"{type(c)=}")
Pros:
- I think it would be easier for end-users to interact with the library, since they only need to create a Dataset, and don't need to worry about finding the correct subclass for their data back-end.
Cons:
- We restrict ourselves to have all datasets take the same arguments/keyword arguments
- We restrict ourselves to only have one dataset per file extension. (This can somewhat be circumvented, but it is not as elegant)
- It might be slightly more difficult to debug a dataset object, because it is not completely straightforward which subclass it is.