新手學習使用docker toolchain遇到問題

您好,我公司正嘗試想使用貴司平台,下載toolchain docker環境,docker已經建立起來,也pull toolchain image。我將ai-training/detection/ 底下的 yolov5 整個複製到我的原生UBUNTU平台上,依照說明安裝好了 requirement.txt 與 cuda driver,然後根據tutorial裡面的README.MD下指令,卻在training步驟發生錯誤

下如下指令

$CUDA_VISIBLE_DEVICES='0' python train.py --data coco128.yaml --cfg yolov5s-noupsample.yaml --weights 'best.pt' --batch-size 2 --epoch 2

回應如下

Using torch 2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1060 3GB, 3005MB)

Namespace(adam=False, batch_size=2, bucket='', cache_images=False, cfg='./models/yolov5s-noupsample.yaml', data='./data/coco128.yaml', device='', epochs=2, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp2', single_cls=False, sync_bn=False, total_batch_size=2, weights='best.pt', workers=8, world_size=1)

Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/

Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}


                from n   params module                                 arguments                    

 0               -1 1     3520 models.common.Focus                    [3, 32, 3]                   

 1               -1 1    18560 models.common.Conv                     [32, 64, 3, 2]               

 2               -1 1    19904 models.common.BottleneckCSP            [64, 64, 1]                  

 3               -1 1    73984 models.common.Conv                     [64, 128, 3, 2]              

 4               -1 1   161152 models.common.BottleneckCSP            [128, 128, 3]                

 5               -1 1   295424 models.common.Conv                     [128, 256, 3, 2]             

 6               -1 1   641792 models.common.BottleneckCSP            [256, 256, 3]                

 7               -1 1  1180672 models.common.Conv                     [256, 512, 3, 2]             

 8               -1 1   656896 models.common.SPP                      [512, 512, [5, 9, 13]]       

 9               -1 1  1248768 models.common.BottleneckCSP            [512, 512, 1, False]         

 10                4 1   147712 models.common.Conv                     [128, 128, 3, 1]             

 11                6 1   590336 models.common.Conv                     [256, 256, 3, 1]             

 12           [7, 9] 1        0 models.common.Concat                   [1]                          

 13               -1 1  1510912 models.common.BottleneckCSP            [1024, 512, 1, False]        

 14     [10, 11, 13] 1   229245 models.yolo.Detect                     [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]

[W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.

/home/willy/anaconda3/envs/k-yolov5/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)

 return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

Model Summary: 201 layers, 6778877 parameters, 6778877 gradients, 17.0 GFLOPS


Transferred 263/265 items from best.pt

Optimizer groups: 45 .bias, 50 conv.weight, 42 other

*image_path_i ../coco128/images/train2017/classes.png

Traceback (most recent call last):

 File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 377, in __init__

   raise ValueError('error: not find',image_path_i)

ValueError: ('error: not find', '../coco128/images/train2017/classes.png')


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

 File "train.py", line 491, in <module>

   train(hyp, opt, device, tb_writer, wandb)

 File "train.py", line 183, in train

   dataloader, dataset = create_dataloader(train_path, imgsz, batch_size, gs, opt,

 File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 59, in create_dataloader

   dataset = LoadImagesAndLabels(path, imgsz, batch_size,

 File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 396, in __init__

   raise Exception('Error loading data from %s: %s\nSee %s' % (path, e, help_url))

Exception: Error loading data from ../coco128/images/train2017/: ('error: not find', '../coco128/images/train2017/classes.png')

See https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data

dataset 是coco128,與yolov5位於同一層目錄,如readme.me裡面說的一樣結構。

請問是為什麼說找不到classes.png呢?coco128裡面並沒有這張圖片為何會去找它?

Comments

  • 問題已經解決,原來是用labelImg標定時留下一個classes.txt的緣故,刪掉就好了

  • 繼續執行後,又會繼續出現問題,出現

    RuntimeError: result type Float can't be cast to the desired output type long int

    後來根據網路上搜尋到的解法,修改loss.py來解決

    所以docker上pull下來的image並不是最終完整版,是嗎?

  • 而再繼續往下執行下去,做到generate_npy,執行

    python yolov5_generate_npy.py

    會出現如下訊息

    /home/willy/anaconda3/envs/yolov5/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)

     return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

    三個npy檔是有產生.可是這結果是正確的嗎?

  • 看來問題都是pytorch版本太新的緣故,重新建立conda環境,使用torch=1.7版,原本出現的問題都幾乎不見了,到了轉onnx時會再出現protobuf版本太新的問題,也是要降版本才會成功。

  • 您好,

    是的,這個範例是舊的,如果要使用的話必須要對齊以前的pytorch等等的版本。

  • 終於作到最後一步,

    # Evaluation

    python test.py --weights runs/train/exp/weights/best.pt --verbose

    結果

    Namespace(augment=False, batch_size=32, conf_thres=0.001, data='data/coco128.yaml', device='cpu', exist_ok=False, img_size=640, iou_thres=0.65, name='exp', project='runs/test', save_conf=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=True, weights=['runs/train/exp/weights/best.pt'])

    Using torch 1.7.0 CPU


    Fusing layers...

    [W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

    Model Summary: 164 layers, 6772285 parameters, 0 gradients, 16.8 GFLOPS

    ***cache_path ../coco128/labels/train2017.cache

    Scanning labels ../coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 128it [00:00, 4028.78it/s]

                  Class     Images    Targets          P          R     mAP@.5 mAP@.5:.95: 25%|▎| 1/4 [01:09<03:28, 69.61s/iKilled


    沒有結果!請問為什麼呢?Reason: Unsupported hardware.是指什麼?這一步驟是需要在實際的Kneron硬體平台上才能執行的嗎?

  • 最後加上 --device 0 就會有輸出結果了。

The discussion has been closed due to inactivity. To continue with the topic, please feel free to post a new discussion.