新手學習使用docker toolchain遇到問題
您好,我公司正嘗試想使用貴司平台,下載toolchain docker環境,docker已經建立起來,也pull toolchain image。我將ai-training/detection/ 底下的 yolov5 整個複製到我的原生UBUNTU平台上,依照說明安裝好了 requirement.txt 與 cuda driver,然後根據tutorial裡面的README.MD下指令,卻在training步驟發生錯誤
下如下指令
$CUDA_VISIBLE_DEVICES='0' python train.py --data coco128.yaml --cfg yolov5s-noupsample.yaml --weights 'best.pt' --batch-size 2 --epoch 2
回應如下
Using torch 2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1060 3GB, 3005MB)
Namespace(adam=False, batch_size=2, bucket='', cache_images=False, cfg='./models/yolov5s-noupsample.yaml', data='./data/coco128.yaml', device='', epochs=2, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp2', single_cls=False, sync_bn=False, total_batch_size=2, weights='best.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
from n params module arguments
0 -1 1 3520 models.common.Focus [3, 32, 3]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
10 4 1 147712 models.common.Conv [128, 128, 3, 1]
11 6 1 590336 models.common.Conv [256, 256, 3, 1]
12 [7, 9] 1 0 models.common.Concat [1]
13 -1 1 1510912 models.common.BottleneckCSP [1024, 512, 1, False]
14 [10, 11, 13] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
[W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.
/home/willy/anaconda3/envs/k-yolov5/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Model Summary: 201 layers, 6778877 parameters, 6778877 gradients, 17.0 GFLOPS
Transferred 263/265 items from best.pt
Optimizer groups: 45 .bias, 50 conv.weight, 42 other
*image_path_i ../coco128/images/train2017/classes.png
Traceback (most recent call last):
File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 377, in __init__
raise ValueError('error: not find',image_path_i)
ValueError: ('error: not find', '../coco128/images/train2017/classes.png')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 491, in <module>
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 183, in train
dataloader, dataset = create_dataloader(train_path, imgsz, batch_size, gs, opt,
File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 59, in create_dataloader
dataset = LoadImagesAndLabels(path, imgsz, batch_size,
File "/ai-data/kneron-yolov5/yolov5/utils/datasets.py", line 396, in __init__
raise Exception('Error loading data from %s: %s\nSee %s' % (path, e, help_url))
Exception: Error loading data from ../coco128/images/train2017/: ('error: not find', '../coco128/images/train2017/classes.png')
See https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data
dataset 是coco128,與yolov5位於同一層目錄,如readme.me裡面說的一樣結構。
請問是為什麼說找不到classes.png呢?coco128裡面並沒有這張圖片為何會去找它?
Comments
問題已經解決,原來是用labelImg標定時留下一個classes.txt的緣故,刪掉就好了
繼續執行後,又會繼續出現問題,出現
RuntimeError: result type Float can't be cast to the desired output type long int
後來根據網路上搜尋到的解法,修改loss.py來解決
所以docker上pull下來的image並不是最終完整版,是嗎?
而再繼續往下執行下去,做到generate_npy,執行
python yolov5_generate_npy.py
會出現如下訊息
/home/willy/anaconda3/envs/yolov5/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
三個npy檔是有產生.可是這結果是正確的嗎?
看來問題都是pytorch版本太新的緣故,重新建立conda環境,使用torch=1.7版,原本出現的問題都幾乎不見了,到了轉onnx時會再出現protobuf版本太新的問題,也是要降版本才會成功。
您好,
是的,這個範例是舊的,如果要使用的話必須要對齊以前的pytorch等等的版本。
終於作到最後一步,
# Evaluation
python test.py --weights runs/train/exp/weights/best.pt --verbose
結果
Namespace(augment=False, batch_size=32, conf_thres=0.001, data='data/coco128.yaml', device='cpu', exist_ok=False, img_size=640, iou_thres=0.65, name='exp', project='runs/test', save_conf=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=True, weights=['runs/train/exp/weights/best.pt'])
Using torch 1.7.0 CPU
Fusing layers...
[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.
Model Summary: 164 layers, 6772285 parameters, 0 gradients, 16.8 GFLOPS
***cache_path ../coco128/labels/train2017.cache
Scanning labels ../coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 128it [00:00, 4028.78it/s]
Class Images Targets P R mAP@.5 mAP@.5:.95: 25%|▎| 1/4 [01:09<03:28, 69.61s/iKilled
沒有結果!請問為什麼呢?Reason: Unsupported hardware.是指什麼?這一步驟是需要在實際的Kneron硬體平台上才能執行的嗎?
最後加上 --device 0 就會有輸出結果了。