新手學習使用docker toolchain遇到問題

陳蔚禮 · May 2024

您好，我公司正嘗試想使用貴司平台，下載toolchain docker環境，docker已經建立起來，也pull toolchain image。我將ai-training/detection/ 底下的 yolov5 整個複製到我的原生UBUNTU平台上，依照說明安裝好了 requirement.txt 與 cuda driver，然後根據tutorial裡面的README.MD下指令，卻在training步驟發生錯誤

下如下指令

＄CUDA_VISIBLE_DEVICES='0' python train.py --data coco128.yaml --cfg yolov5s-noupsample.yaml --weights 'best.pt' --batch-size 2 --epoch 2

回應如下

Using torch 2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1060 3GB, 3005MB)

Namespace(adam=False, batch_size=2, bucket='', cache_images=False, cfg='./models/yolov5s-noupsample.yaml', data='./data/coco128.yaml', device='', epochs=2, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp2', single_cls=False, sync_bn=False, total_batch_size=2, weights='best.pt', workers=8, world_size=1)

Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/

Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}

from n params module arguments

0 -1 1 3520 models.common.Focus [3, 32, 3]

1 -1 1 18560 models.common.Conv [32, 64, 3, 2]

2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]

3 -1 1 73984 models.common.Conv [64, 128, 3, 2]

4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]

5 -1 1 295424 models.common.Conv [128, 256, 3, 2]

6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3]

7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]

8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]

9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]

10 4 1 147712 models.common.Conv [128, 128, 3, 1]

11 6 1 590336 models.common.Conv [256, 256, 3, 1]

12 [7, 9] 1 0 models.common.Concat [1]

13 -1 1 1510912 models.common.BottleneckCSP [1024, 512, 1, False]

14 [10, 11, 13] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]

[W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.

/home/willy/anaconda3/envs/k-yolov5/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)

return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

Model Summary: 201 layers, 6778877 parameters, 6778877 gradients, 17.0 GFLOPS

Transferred 263/265 items from best.pt

Optimizer groups: 45 .bias, 50 conv.weight, 42 other

*image_path_i ../coco128/images/train2017/classes.png

Traceback (most recent call last):

File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 377, in __init__

raise ValueError('error: not find',image_path_i)

ValueError: ('error: not find', '../coco128/images/train2017/classes.png')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "train.py", line 491, in <module>

train(hyp, opt, device, tb_writer, wandb)

File "train.py", line 183, in train

dataloader, dataset = create_dataloader(train_path, imgsz, batch_size, gs, opt,

File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 59, in create_dataloader

dataset = LoadImagesAndLabels(path, imgsz, batch_size,

File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 396, in __init__

raise Exception('Error loading data from %s: %s\nSee %s' % (path, e, help_url))

Exception: Error loading data from ../coco128/images/train2017/: ('error: not find', '../coco128/images/train2017/classes.png')

See https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data

dataset 是coco128，與yolov5位於同一層目錄，如readme.me裡面說的一樣結構。

請問是為什麼說找不到classes.png呢？coco128裡面並沒有這張圖片為何會去找它？

陳蔚禮 · May 2024

問題已經解決，原來是用labelImg標定時留下一個classes.txt的緣故，刪掉就好了

陳蔚禮 · May 2024

繼續執行後，又會繼續出現問題，出現

RuntimeError: result type Float can't be cast to the desired output type long int

後來根據網路上搜尋到的解法，修改loss.py來解決

所以docker上pull下來的image並不是最終完整版，是嗎？

陳蔚禮 · May 2024

而再繼續往下執行下去，做到generate_npy，執行

python yolov5_generate_npy.py

會出現如下訊息

/home/willy/anaconda3/envs/yolov5/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)

return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

三個npy檔是有產生．可是這結果是正確的嗎？

陳蔚禮 · May 2024

看來問題都是pytorch版本太新的緣故，重新建立conda環境，使用torch＝1.7版，原本出現的問題都幾乎不見了，到了轉onnx時會再出現protobuf版本太新的問題，也是要降版本才會成功。

Maria Chen · May 2024

您好，

是的，這個範例是舊的，如果要使用的話必須要對齊以前的pytorch等等的版本。

陳蔚禮 · May 2024

終於作到最後一步，

# Evaluation

python test.py --weights runs/train/exp/weights/best.pt --verbose

結果

Namespace(augment=False, batch_size=32, conf_thres=0.001, data='data/coco128.yaml', device='cpu', exist_ok=False, img_size=640, iou_thres=0.65, name='exp', project='runs/test', save_conf=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=True, weights=['runs/train/exp/weights/best.pt'])

Using torch 1.7.0 CPU

Fusing layers...

[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

Model Summary: 164 layers, 6772285 parameters, 0 gradients, 16.8 GFLOPS

***cache_path ../coco128/labels/train2017.cache

Scanning labels ../coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 128it [00:00, 4028.78it/s]

Class Images Targets P R mAP@.5 mAP@.5:.95: 25%|▎| 1/4 [01:09<03:28, 69.61s/iKilled

沒有結果！請問為什麼呢？Reason: Unsupported hardware.是指什麼？這一步驟是需要在實際的Kneron硬體平台上才能執行的嗎？

陳蔚禮 · May 2024

最後加上 --device 0 就會有輸出結果了。

新手學習使用docker toolchain遇到問題

Comments