revive make_data_split

changed the description

added 1 commit

13cc9d1a - Update requirements.txt

Nice! Some points to discuss:

Maybe we should create a seperate directory for the example toml
are the changes to orcasong_contrib concatenate necessary?
can we get rid of natsort and docopt, id like to reduce dependencies to a minimum

Ah and can I also suggest to expand the doc a bit. Because from reading it I still dont know what this script actually does ^^

added 1 commit

5f3b11e0 - united layout with other tools from doctopts to argparse, no more natsort,...

Compare with previous version

added 1 commit

611acf0c - deleted old data tools

Compare with previous version

added 1 commit

96fd0003 - Update tools.rst

Compare with previous version

K, I updated:

separate dir for configs
deleted old data tools in contrib
no more natsort + argparse instead of docopt
added a bit to the docu, but there is already everything that happens^^ the example config has explanations for the parameters and with that it should be doable

Thanks! As a quick note, it would be good to not have too many changes in one request, as this makes it more difficult to review. E.g. i see that you added

There is also a newer, faster version 2:

.. code-block:: python

    from orcasong.tools.shuffle2 import shuffle_v2
    
    shuffle_v2(output_filepath_concat)

to the docu, but this has nothing to do with the data split thing, and is actually also not how shuffle2 is run (should only be run through parser atm). So please only change the things that regard what you add in the merge request. I can sort this out before merging.

Should we really delete the orcasong_contrib stuff? This was more intended as an archive of michael scripts to look stuff up.

And finally, it would be really good to have some sort of tests, its very difficult to maintain otherwise. So if I get this right, this script looks up filepaths first, and then arranges them into a txt list. So maybe you can just test it with a list of strings like so ["file1.h5", "file2.h5"] etc and make sure the output is as intended.

Thanks for your feedback. I will revert the changes about the shuffle2, I just thought I mention it there as it seemed to be missing.

What was in the contrib stuff is basically this: make_data_split -> now integrated concatenate -> now a new version by you shuffle -> now a new version by you more example configs -> for special cases and selections, hardly applicable as is for anyone. Whenever someone creates his sets he has to think about how to divide them himself. (one) detx -> well, they should be taken from the production directly

So, the only thing could be retained are the old concatenate + shuffle.

tests: Yes, I know^^ I will have another look at the tests and try to figure that out.

added 2 commits

a4cfe2b0 - added test routine
8d7cf574 - Merge branch 'concatenate_scripts' of git.km3net.de:ml/OrcaSong into concatenate_scripts

Compare with previous version

added 1 commit

808450f9 - remove shuffle2 part

Compare with previous version

Okay, I added the tests. Can you have a look, are they ok like that?

This would be it from my side. Feel free to leave the old scripts in the contrib or not.

added 2 commits

cb84b8a6 - fixted tests
f18878a1 - remove test output

Compare with previous version

added 1 commit

70c8f563 - added option to produce submittable bash scripts to do the concat + shuffle

Compare with previous version

added 1 commit

5032f1af - Update tools.rst

Compare with previous version

I added back in the option to directly create bash scripts to do the concatenation and shuffle. In these, there will be basically only the command to execute the concatenation/shuffle. And that for every file(train files, val file). Also added a test for that.

Looking good, are you done with it? Ill do some minor changes and merge it. I probably wont be able to release it just yet though due to the artifact file size limitation

In principle yes. I am currently doing a test run with a larger set size. Let me see if everything is as expected.

Ah, the shuffle python programs cant be started as they are from the command line. the part with "main" is missing. The bash scripts that are created already need to know that command, i.e. "h5shuffle file". So, once the shuffle 2 is sorted out, in https://git.km3net.de/ml/OrcaSong/-/blob/concatenate_scripts/orcasong/tools/make_data_split.py#L343 the command has to be edited. The one at the moment, "postproc" is wrong^^

revive make_data_split

Merge request reports

Activity