Weasel serializes `commands` field as string
Description
It looks like Weasel is reading commands as a string rather than a list. This causes access to the name field to raise the error TypeError: string indices must be integers.
Environment
Name Version Build Channel weasel 0.3.4 py39hca03da5_0
Error
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ /opt/homebrew/Caskroom/miniconda/base/envs/spacy_dev_pt_core_chat_lg/lib/python3.9/site-packages โ
โ /weasel/cli/run.py:42 in project_run_cli โ
โ โ
โ 39 โ โ print_run_help(project_dir, subcommand, parent_command) โ
โ 40 โ else: โ
โ 41 โ โ overrides = parse_config_overrides(ctx.args) โ
โ โฑ 42 โ โ project_run( โ
โ 43 โ โ โ project_dir, โ
โ 44 โ โ โ subcommand, โ
โ 45 โ โ โ overrides=overrides, โ
โ โ
โ โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ locals โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ ctx = <click.core.Context object at 0x11b83e9a0> โ โ
โ โ dry = False โ โ
โ โ force = False โ โ
โ โ overrides = { โ โ
โ โ โ 'vars.experiment': 29, โ โ
โ โ โ 'vars.enabled_gazetteers': 'person,address', โ โ
โ โ โ 'vars.input_data': โ โ
โ โ 'experiments/028/data/oversampled_merged_dataset.json', โ โ
โ โ โ 'vars.address_gazetteer': โ โ
โ โ 'assets/datasets/addresses/pt_br_address-gazetter-2.jsonl' โ โ
โ โ } โ โ
โ โ parent_command = 'python -m weasel' โ โ
โ โ project_dir = PosixPath('.') โ โ
โ โ show_help = False โ โ
โ โ subcommand = 'experiment' โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โ โ
โ /opt/homebrew/Caskroom/miniconda/base/envs/spacy_dev_pt_core_chat_lg/lib/python3.9/site-packages โ
โ /weasel/cli/run.py:81 in project_run โ
โ โ
โ 78 โ skip_requirements_check (bool): No longer used, deprecated. โ
โ 79 โ """ โ
โ 80 โ config = load_project_config(project_dir, overrides=overrides) โ
โ โฑ 81 โ commands = {cmd["name"]: cmd for cmd in config.get("commands", [])} โ
โ 82 โ workflows = config.get("workflows", {}) โ
โ 83 โ validate_subcommand(list(commands.keys()), list(workflows.keys()), subcommand) โ
โ 84 โ
โ โ
โ โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ locals โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ capture = False โ โ
โ โ config = { โ โ
โ โ โ 'title': 'NER portuguese chat', โ โ
โ โ โ 'description': 'Project tunning NER component in portuguese โ โ
โ โ model using chat corpus', โ โ
โ โ โ 'directories': [ โ โ
โ โ โ โ 'assets', โ โ
โ โ โ โ 'scripts', โ โ
โ โ โ โ 'experiments', โ โ
โ โ โ โ 'baseline', โ โ
โ โ โ โ 'packages' โ โ
โ โ โ ], โ โ
โ โ โ 'assets': [ โ โ
โ โ โ โ { โ โ
โ โ โ โ โ 'dest': 'assets/train.json', โ โ
โ โ โ โ โ 'description': 'Training data' โ โ
โ โ โ โ }, โ โ
โ โ โ โ { โ โ
โ โ โ โ โ 'dest': 'assets/dev.json', โ โ
โ โ โ โ โ 'description': 'Development data' โ โ
โ โ โ โ } โ โ
โ โ โ ], โ โ
โ โ โ 'commands': '[{"name":"download","help":"Download the โ โ
โ โ pretrained pipeline","script":["python '+5046, โ โ
โ โ โ 'env': {}, โ โ
โ โ โ 'vars': { โ โ
โ โ โ โ 'name': 'core_chat_lg', โ โ
โ โ โ โ 'lang': 'pt', โ โ
โ โ โ โ 'pipeline': 'pt_core_news_lg', โ โ
โ โ โ โ 'version': '0.0.0', โ โ
โ โ โ โ 'dataset': 'raw.json', โ โ
โ โ โ โ 'train': 'train.json', โ โ
โ โ โ โ 'dev': 'dev.json', โ โ
โ โ โ โ 'test': 'test.json', โ โ
โ โ โ โ 'test_data': โ โ
โ โ 'assets/datasets/chats/sample-chats-manual-labeled-test.json', โ โ
โ โ โ โ 'input_data': โ โ
โ โ 'assets/datasets/chats/sample-chats-manual-labeled-train.json', โ โ
โ โ โ โ ... +9 โ โ
โ โ โ }, โ โ
โ โ โ 'workflows': { โ โ
โ โ โ โ 'experiment': [ โ โ
โ โ โ โ โ 'fetch-data', โ โ
โ โ โ โ โ 'split-data', โ โ
โ โ โ โ โ 'create-gazetteer', โ โ
โ โ โ โ โ 'convert', โ โ
โ โ โ โ โ 'train', โ โ
โ โ โ โ โ 'evaluate' โ โ
โ โ โ โ ], โ โ
โ โ โ โ 'experiment_search': [ โ โ
โ โ โ โ โ 'fetch-data', โ โ
โ โ โ โ โ 'split-data', โ โ
โ โ โ โ โ 'create-gazetteer', โ โ
โ โ โ โ โ 'convert', โ โ
โ โ โ โ โ 'train-search', โ โ
โ โ โ โ โ 'evaluate' โ โ
โ โ โ โ ], โ โ
โ โ โ โ 'experiment_new': ['setup_experiment', 'create-config'] โ โ
โ โ โ } โ โ
โ โ } โ โ
โ โ dry = False โ โ
โ โ force = False โ โ
โ โ overrides = { โ โ
โ โ โ 'vars.experiment': 29, โ โ
โ โ โ 'vars.enabled_gazetteers': 'person,address', โ โ
โ โ โ 'vars.input_data': โ โ
โ โ 'experiments/028/data/oversampled_merged_dataset.json', โ โ
โ โ โ 'vars.address_gazetteer': โ โ
โ โ 'assets/datasets/addresses/pt_br_address-gazetter-2.jsonl' โ โ
โ โ } โ โ
โ โ parent_command = 'python -m weasel' โ โ
โ โ project_dir = PosixPath('.') โ โ
โ โ skip_requirements_check = False โ โ
โ โ subcommand = 'experiment' โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โ โ
โ /opt/homebrew/Caskroom/miniconda/base/envs/spacy_dev_pt_core_chat_lg/lib/python3.9/site-packages โ
โ /weasel/cli/run.py:81 in <dictcomp> โ
โ โ
โ 78 โ skip_requirements_check (bool): No longer used, deprecated. โ
โ 79 โ """ โ
โ 80 โ config = load_project_config(project_dir, overrides=overrides) โ
โ โฑ 81 โ commands = {cmd["name"]: cmd for cmd in config.get("commands", [])} โ
โ 82 โ workflows = config.get("workflows", {}) โ
โ 83 โ validate_subcommand(list(commands.keys()), list(workflows.keys()), subcommand) โ
โ 84 โ
โ โ
โ โญโโโโโโโโโโโโโโโโโโ locals โโโโโโโโโโโโโโโโโโโฎ โ
โ โ .0 = <str_iterator object at 0x107dc13a0> โ โ
โ โ cmd = '[' โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
TypeError: string indices must be integers
Thanks.
Could you paste your workflow file? This can't be the right error behaviour no matter what, but I'm trying to figure out whether it's doing this on a workflow that should work, or whether it's just heading down the wrong error path.
My workflow is a bit customized in how it uses variables, but it worked for a time. I wonder if my conda environment changed some version that led to this new behavior.
Workflow file:
title: "NER portuguese chat"
description: "Project tunning NER component in portuguese model using chat corpus"
# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
name: "core_chat_lg"
lang: "pt"
pipeline: "pt_core_news_lg"
version: "0.0.0"
dataset: "raw.json"
train: "train.json"
dev: "dev.json"
test: "test.json"
test_data: "assets/datasets/chats/sample-chats-manual-labeled-test.json"
input_data: "assets/datasets/chats/sample-chats-manual-labeled-train.json"
experiment: "01"
train_size: 0.8
enabled_gazetteers: "null"
person_gazetteer: "assets/datasets/names_surnames/pt_br_names-gazetteer.jsonl"
address_gazetteer: "assets/datasets/addresses/pt_br_address-gazetter.jsonl"
person_entity_ruler_patterns: "null"
loc_entity_ruler_patterns: "null"
gazetteers_pattern: "gazetteers_patterns.jsonl"
# Set your GPU ID, -1 is CPU
gpu_id: -1
# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["assets", "scripts", "experiments", "baseline", "packages"]
# Assets that should be downloaded or available in the directory. We're shipping
# them with the project, so they won't have to be downloaded.
assets:
- dest: "assets/train.json"
description: "Training data"
- dest: "assets/dev.json"
description: "Development data"
# Workflows are sequences of commands (see below) executed in order. You can
# run them via "spacy project run [workflow]". If a commands's inputs/outputs
# haven't changed, it won't be re-run.
workflows:
experiment:
- fetch-data
- split-data
- create-gazetteer
- convert
- train
- evaluate
experiment_search:
- fetch-data
- split-data
- create-gazetteer
- convert
- train-search
- evaluate
experiment_new:
- setup_experiment
- create-config
# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
- name: "download"
help: "Download the pretrained pipeline"
script:
- "python -m spacy download ${vars.pipeline}"
- name: "setup_experiment"
help: "Setup experiment directory structure"
script:
- "mkdir -p experiments/0${vars.experiment}/data experiments/0${vars.experiment}/configs experiments/0${vars.experiment}/training experiments/0${vars.experiment}/corpus experiments/0${vars.experiment}/scripts"
- "touch experiments/0${vars.experiment}/README.md"
- name: "create-config"
help: "Create a config for updating only NER from an existing pipeline"
script:
- "python scripts/create_config.py ${vars.pipeline} ner experiments/0${vars.experiment}/data/${vars.gazetteers_pattern} ${vars.enabled_gazetteers} experiments/0${vars.experiment}/configs/config.cfg"
deps:
- "scripts/create_config.py"
outputs:
- "experiments/0${vars.experiment}/configs/config.cfg"
- name: "fetch-data"
help: "Fetch the training and test data"
script:
- "cp ${vars.input_data} experiments/0${vars.experiment}/data/${vars.dataset}"
- "cp ${vars.test_data} experiments/0${vars.experiment}/data/${vars.test}"
deps:
- "${vars.input_data}"
- "${vars.test_data}"
outputs:
- "experiments/0${vars.experiment}/data/${vars.dataset}"
- "experiments/0${vars.experiment}/data/${vars.test}"
- name: "split-data"
help: "Split the data into training and eval sets, and copy the test data"
script:
- "python scripts/split_train_test.py experiments/0${vars.experiment}/data/${vars.dataset} ${vars.train_size} experiments/0${vars.experiment}/data/${vars.train} experiments/0${vars.experiment}/data/${vars.dev}"
deps:
- "experiments/0${vars.experiment}/data/${vars.dataset}"
- "scripts/split_train_test.py"
outputs:
- "experiments/0${vars.experiment}/data/${vars.train}"
- "experiments/0${vars.experiment}/data/${vars.dev}"
- name: "create-gazetteer"
help: "Merge gazetter into single pattern file"
script:
- "python scripts/merge_gazetters.py ${vars.enabled_gazetteers} ${vars.person_gazetteer} ${vars.address_gazetteer} experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"
deps:
- "${vars.person_gazetteer}"
- "${vars.address_gazetteer}"
- "scripts/merge_gazetters.py"
outputs:
- "experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"
- name: "convert"
help: "Convert the data to spaCy's binary format"
script:
- "mkdir -p experiments/0${vars.experiment}/corpus"
- "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.train} experiments/0${vars.experiment}/corpus/train.spacy"
- "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.dev} experiments/0${vars.experiment}/corpus/dev.spacy"
- "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.test} experiments/0${vars.experiment}/corpus/test.spacy"
deps:
- "experiments/0${vars.experiment}/data/${vars.train}"
- "experiments/0${vars.experiment}/data/${vars.dev}"
- "experiments/0${vars.experiment}/data/${vars.test}"
- "scripts/convert.py"
outputs:
- "experiments/0${vars.experiment}/corpus/train.spacy"
- "experiments/0${vars.experiment}/corpus/dev.spacy"
- "experiments/0${vars.experiment}/corpus/test.spacy"
- name: "train"
help: "Update the NER model"
script:
- "mkdir -p experiments/0${vars.experiment}/training"
- "python -m spacy train experiments/0${vars.experiment}/configs/config.cfg --output experiments/0${vars.experiment}/training/ --paths.entity_ruler_patterns experiments/0${vars.experiment}/data/${vars.gazetteers_pattern} --paths.person_entity_ruler_patterns ${vars.person_entity_ruler_patterns} --paths.loc_entity_ruler_patterns ${vars.loc_entity_ruler_patterns} --paths.train experiments/0${vars.experiment}/corpus/train.spacy --paths.dev experiments/0${vars.experiment}/corpus/dev.spacy --gpu-id ${vars.gpu_id}"
deps:
- "experiments/0${vars.experiment}/configs/config.cfg"
- "experiments/0${vars.experiment}/corpus/train.spacy"
- "experiments/0${vars.experiment}/corpus/dev.spacy"
outputs:
- "experiments/0${vars.experiment}/training/model-best"
- name: "train-search"
help: "Run customized training runs for hyperparameter search using [Weights & Biases Sweeps](https://docs.wandb.ai/guides/sweeps)"
script:
- "mkdir -p experiments/0${vars.experiment}/training"
- "python scripts/train/wandb_sweeps.py experiments/0${vars.experiment}/configs/config.cfg experiments/0${vars.experiment}/training/ experiments/0${vars.experiment}/corpus/train.spacy experiments/0${vars.experiment}/corpus/dev.spacy experiments/0${vars.experiment}/corpus/train.spacy --gazetteer-path experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"
deps:
- "scripts/train/wandb_sweeps.py"
- "experiments/0${vars.experiment}/configs/config.cfg"
- "experiments/0${vars.experiment}/corpus/train.spacy"
- "experiments/0${vars.experiment}/corpus/dev.spacy"
outputs:
- "experiments/0${vars.experiment}/training/model-best"
- name: "evaluate"
help: "Evaluate the model and export metrics"
script:
- "python -m spacy evaluate experiments/0${vars.experiment}/training/model-best experiments/0${vars.experiment}/corpus/test.spacy --output experiments/0${vars.experiment}/metrics.json"
deps:
- "experiments/0${vars.experiment}/corpus/test.spacy"
- "experiments/0${vars.experiment}/training/model-best"
outputs:
- "experiments/0${vars.experiment}/metrics.json"
- name: package
help: "Package the trained model as a pip package"
script:
- "python -m spacy package experiments/0${vars.experiment}/training/model-best packages --name ${vars.name} --version ${vars.version} --force"
deps:
- "experiments/0${vars.experiment}/training/model-best"
outputs_no_cache:
- "packages/${vars.lang}_${vars.name}-${vars.version}/dist/${vars.lang}_${vars.name}-${vars.version}.tar.gz"
- name: visualize-model
help: Visualize the model's output interactively using Streamlit
# https://github.com/explosion/spacy-streamlit/issues/55
script:
- 'python -m streamlit run scripts/visualize_model.py experiments/0${vars.experiment}/training/model-best "AUTOMATION: Nรฃo aceite cobranรงa na entrega se o pedido foi pago pelo app e nunca compartilhe dados pessoais em conversas de chat ou telefone.'
deps:
- "scripts/visualize_model.py"
- "experiments/0${vars.experiment}/training/model-best"
Is the indentation right in 'commands' (maybe it's just a paste thing)? I'd have a quick look at how the file parses in a yaml-to-json converter, just to see if there's some stupid yaml whitespace thing.
I've tried adding indentation, but the error persists. Per YAML spec, we can declare lists with or without indentation.
Try the following YAML at https://onlineyamltools.com/convert-yaml-to-json
list:
- one
- two
And the output will be:
{
"list": [
"one",
"two"
]
}