Skip to content
Snippets Groups Projects

revive make_data_split

Merged Daniel Guderian requested to merge concatenate_scripts into master

Add make_data_split back into the main repo.

It produces a txt that can be used with the current concatenate.

Done:

  • Update make_data_split.py and move to orcasong/tools/
  • adjust and create an example config (same dir)
  • update docu accordingly

to do:

  • add tests - though I dont know how to tbh...
Edited by Daniel Guderian

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Daniel Guderian changed the description

    changed the description

  • added 1 commit

    Compare with previous version

  • Nice! Some points to discuss:

    • Maybe we should create a seperate directory for the example toml
    • are the changes to orcasong_contrib concatenate necessary?
    • can we get rid of natsort and docopt, id like to reduce dependencies to a minimum
  • Ah and can I also suggest to expand the doc a bit. Because from reading it I still dont know what this script actually does ^^

  • added 1 commit

    • 5f3b11e0 - united layout with other tools from doctopts to argparse, no more natsort,...

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • K, I updated:

    • separate dir for configs
    • deleted old data tools in contrib
    • no more natsort + argparse instead of docopt
    • added a bit to the docu, but there is already everything that happens^^ the example config has explanations for the parameters and with that it should be doable
  • Thanks! As a quick note, it would be good to not have too many changes in one request, as this makes it more difficult to review. E.g. i see that you added

    There is also a newer, faster version 2:
    
    .. code-block:: python
    
        from orcasong.tools.shuffle2 import shuffle_v2
        
        shuffle_v2(output_filepath_concat)

    to the docu, but this has nothing to do with the data split thing, and is actually also not how shuffle2 is run (should only be run through parser atm). So please only change the things that regard what you add in the merge request. I can sort this out before merging.

    Should we really delete the orcasong_contrib stuff? This was more intended as an archive of michael scripts to look stuff up.

    And finally, it would be really good to have some sort of tests, its very difficult to maintain otherwise. So if I get this right, this script looks up filepaths first, and then arranges them into a txt list. So maybe you can just test it with a list of strings like so ["file1.h5", "file2.h5"] etc and make sure the output is as intended.

  • Thanks for your feedback. I will revert the changes about the shuffle2, I just thought I mention it there as it seemed to be missing.

    What was in the contrib stuff is basically this: make_data_split -> now integrated concatenate -> now a new version by you shuffle -> now a new version by you more example configs -> for special cases and selections, hardly applicable as is for anyone. Whenever someone creates his sets he has to think about how to divide them himself. (one) detx -> well, they should be taken from the production directly

    So, the only thing could be retained are the old concatenate + shuffle.

    tests: Yes, I know^^ I will have another look at the tests and try to figure that out.

  • Daniel Guderian added 2 commits

    added 2 commits

    • a4cfe2b0 - added test routine
    • 8d7cf574 - Merge branch 'concatenate_scripts' of git.km3net.de:ml/OrcaSong into concatenate_scripts

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • Okay, I added the tests. Can you have a look, are they ok like that?

    This would be it from my side. Feel free to leave the old scripts in the contrib or not.

  • Daniel Guderian added 2 commits

    added 2 commits

    Compare with previous version

  • added 1 commit

    • 70c8f563 - added option to produce submittable bash scripts to do the concat + shuffle

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • I added back in the option to directly create bash scripts to do the concatenation and shuffle. In these, there will be basically only the command to execute the concatenation/shuffle. And that for every file(train files, val file). Also added a test for that.

    Edited by Daniel Guderian
  • Looking good, are you done with it? Ill do some minor changes and merge it. I probably wont be able to release it just yet though due to the artifact file size limitation :sweat_smile:

  • In principle yes. I am currently doing a test run with a larger set size. Let me see if everything is as expected.

  • Ah, the shuffle python programs cant be started as they are from the command line. the part with "main" is missing. The bash scripts that are created already need to know that command, i.e. "h5shuffle file". So, once the shuffle 2 is sorted out, in https://git.km3net.de/ml/OrcaSong/-/blob/concatenate_scripts/orcasong/tools/make_data_split.py#L343 the command has to be edited. The one at the moment, "postproc" is wrong^^

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading