revive make_data_split
Add make_data_split back into the main repo.
It produces a txt that can be used with the current concatenate
.
Done:
- Update make_data_split.py and move to orcasong/tools/
- adjust and create an example config (same dir)
- update docu accordingly
to do:
- add tests - though I dont know how to tbh...
Merge request reports
Activity
added 1 commit
- 5f3b11e0 - united layout with other tools from doctopts to argparse, no more natsort,...
Thanks! As a quick note, it would be good to not have too many changes in one request, as this makes it more difficult to review. E.g. i see that you added
There is also a newer, faster version 2: .. code-block:: python from orcasong.tools.shuffle2 import shuffle_v2 shuffle_v2(output_filepath_concat)
to the docu, but this has nothing to do with the data split thing, and is actually also not how shuffle2 is run (should only be run through parser atm). So please only change the things that regard what you add in the merge request. I can sort this out before merging.
Should we really delete the orcasong_contrib stuff? This was more intended as an archive of michael scripts to look stuff up.
And finally, it would be really good to have some sort of tests, its very difficult to maintain otherwise. So if I get this right, this script looks up filepaths first, and then arranges them into a txt list. So maybe you can just test it with a list of strings like so ["file1.h5", "file2.h5"] etc and make sure the output is as intended.
Thanks for your feedback. I will revert the changes about the shuffle2, I just thought I mention it there as it seemed to be missing.
What was in the contrib stuff is basically this: make_data_split -> now integrated concatenate -> now a new version by you shuffle -> now a new version by you more example configs -> for special cases and selections, hardly applicable as is for anyone. Whenever someone creates his sets he has to think about how to divide them himself. (one) detx -> well, they should be taken from the production directly
So, the only thing could be retained are the old concatenate + shuffle.
tests: Yes, I know^^ I will have another look at the tests and try to figure that out.
added 1 commit
- 70c8f563 - added option to produce submittable bash scripts to do the concat + shuffle
I added back in the option to directly create bash scripts to do the concatenation and shuffle. In these, there will be basically only the command to execute the concatenation/shuffle. And that for every file(train files, val file). Also added a test for that.
Edited by Daniel GuderianAh, the shuffle python programs cant be started as they are from the command line. the part with "main" is missing. The bash scripts that are created already need to know that command, i.e. "h5shuffle file". So, once the shuffle 2 is sorted out, in https://git.km3net.de/ml/OrcaSong/-/blob/concatenate_scripts/orcasong/tools/make_data_split.py#L343 the command has to be edited. The one at the moment, "postproc" is wrong^^