Skip to content

get-preprocessed-dataset-squad

Automatically generated README for this automation recipe: get-preprocessed-dataset-squad

Category: AI/ML datasets

License: Apache 2.0

  • CM meta description for this script: _cm.yaml
  • Output cached? True

Reuse this script in your project

Install MLCommons CM automation meta-framework

Pull CM repository with this automation recipe (CM script)

cm pull repo mlcommons@cm4mlops

cmr "get dataset preprocessed tokenized squad" --help

Run this script

Run this script via CLI
cm run script --tags=get,dataset,preprocessed,tokenized,squad[,variations] 
Run this script via CLI (alternative)
cmr "get dataset preprocessed tokenized squad [variations]" 
Run this script from Python
import cmind

r = cmind.access({'action':'run'
              'automation':'script',
              'tags':'get,dataset,preprocessed,tokenized,squad'
              'out':'con',
              ...
              (other input keys for this script)
              ...
             })

if r['return']>0:
    print (r['error'])
Run this script via Docker (beta)
cm docker script "get dataset preprocessed tokenized squad[variations]" 

Variations

  • Group "calibration-set"

    Click here to expand this section.

    • _calib1
      • ENV variables:
        • CM_DATASET_SQUAD_CALIBRATION_SET: one
    • _calib2
      • ENV variables:
        • CM_DATASET_SQUAD_CALIBRATION_SET: two
    • _no-calib (default)
      • ENV variables:
        • CM_DATASET_SQUAD_CALIBRATION_SET: ``
  • Group "doc-stride"

    Click here to expand this section.

    • _doc-stride.#
      • ENV variables:
        • CM_DATASET_DOC_STRIDE: #
    • _doc-stride.128 (default)
      • ENV variables:
        • CM_DATASET_DOC_STRIDE: 128
  • Group "packing"

    Click here to expand this section.

    • _packed
      • ENV variables:
        • CM_DATASET_SQUAD_PACKED: yes
  • Group "raw"

    Click here to expand this section.

    • _pickle
      • ENV variables:
        • CM_DATASET_RAW: no
    • _raw (default)
      • ENV variables:
        • CM_DATASET_RAW: yes
  • Group "seq-length"

    Click here to expand this section.

    • _seq-length.#
      • ENV variables:
        • CM_DATASET_MAX_SEQ_LENGTH: #
    • _seq-length.384 (default)
      • ENV variables:
        • CM_DATASET_MAX_SEQ_LENGTH: 384
Default variations

_doc-stride.128,_no-calib,_raw,_seq-length.384

Native script being run

No run file exists for Windows


Script output

cmr "get dataset preprocessed tokenized squad [variations]"  -j