勉強しないとな~blog

ちゃんと勉強せねば…な電気設計エンジニアです。

OAK-D S2進める - 手検出サンプル2

前回の続き。

nokixa.hatenablog.com

やったこと

前回のhand_trackerのサンプルで、OAK-Dデバイスでなく動画ファイルを入力にできるとのことで、試した。
内部でMediaPipeが使用されているので、普通にMediaPipeを使うのでもいいかもしれないが、こちらではBPF(Body Pre Focusing)が用意されているので、結果が良くなるかもしれない。

前のピアノ弾いてる動画で。

環境

前回と同じGitHubリポジトリで。

github.com

前回下記のようにクローンしたものをそのまま使う。

git clone https://github.com/geaxgx/depthai_hand_tracker

実施

python demo.py --input "../../data/2023-08-31 07.48.56.mp4" --output "../../data/2023-08-31 07.48.56_depthai_1.avi"
Palm detection blob : C:\work\oak-d_test\depthai_hand_tracker\models\palm_detection_sh4.blob
Landmark blob       : C:\work\oak-d_test\depthai_hand_tracker\models\hand_landmark_lite_sh4.blob
Video FPS: 15
Original frame size: 2304x1296
Padding on height : 504
Frame working size: 2304x1296
896 anchors have been created
Creating pipeline...
Creating Palm Detection Neural Network...
Creating Hand Landmark Neural Network (2 threads)...
Pipeline created.
[19443010819FF41200] [2.6] [1.075] [NeuralNetwork(0)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
Pipeline started - USB speed: SUPER
[19443010819FF41200] [2.6] [1.076] [NeuralNetwork(3)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.082] [NeuralNetwork(0)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.082] [NeuralNetwork(3)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
FPS : 15.8 f/s (# frames = 458)
# frames w/ no hand           : 1 (0.2%)
# frames w/ palm detection    : 47 (10.3%)
# frames w/ landmark inference : 456 (99.6%)- # after palm detection: 46 - # after landmarks ROI prediction: 410
On frames with at least one landmark inference, average number of landmarks inferences/frame: 1.34
# lm inferences: 610 - # failed lm inferences: 97 (15.9%)
Palm detection round trip            : 27.7 ms
Hand landmark round trip             : 18.7 ms

今回はOAK-Dカメラは不要だが、接続しておかないとエラーになってしまった。
のでつなぎながらやっている。

以前MediaPipeを使ったときと特に変わらない感じ。
当然か?
ほとんど左手しか認識されていない。

モデル変更

-lm_modelオプションで、使うモデルを変えられるようなので、fullのモデルでやってみる。

python demo.py --input "../../data/2023-08-31 07.48.56.mp4" --output "../../data/2023-08-31 07.48.56_depthai_2.avi" --lm_model full
Palm detection blob : C:\work\oak-d_test\depthai_hand_tracker\models\palm_detection_sh4.blob
Landmark blob       : C:\work\oak-d_test\depthai_hand_tracker\models\hand_landmark_full_sh4.blob
Video FPS: 15
Original frame size: 2304x1296
Padding on height : 504
Frame working size: 2304x1296
896 anchors have been created
Creating pipeline...
Creating Palm Detection Neural Network...
Creating Hand Landmark Neural Network (2 threads)...
Pipeline created.
[19443010819FF41200] [2.6] [1.127] [NeuralNetwork(0)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.128] [NeuralNetwork(3)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
Pipeline started - USB speed: SUPER
[19443010819FF41200] [2.6] [1.136] [NeuralNetwork(0)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.136] [NeuralNetwork(3)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
FPS : 13.8 f/s (# frames = 458)
# frames w/ no hand           : 1 (0.2%)
# frames w/ palm detection    : 47 (10.3%)
# frames w/ landmark inference : 456 (99.6%)- # after palm detection: 46 - # after landmarks ROI prediction: 410
On frames with at least one landmark inference, average number of landmarks inferences/frame: 1.29
# lm inferences: 588 - # failed lm inferences: 69 (11.7%)
Palm detection round trip            : 27.5 ms
Hand landmark round trip             : 28.8 ms

いまいち変わらないか、少し悪くなったか…
最初の動画と同じ範囲を切り出したので見えないが、 切り出し範囲外では、両手が認識されるタイミングが増えていた。

BPF使用

次、BPF付きのコードを実施。

いくつかオプションあるが、

  • --body_pre_focusingオプションは、4パターンあるが、今回は両手の検出をしたい(Duoモード、-sオプション設定しない場合のデフォルト)ので、groupに強制される、ので設定はなし
    • right: 右手だけ検出
    • left: 左手だけ検出
    • higher: 手を上げた場合だけ検出(肘との位置関係)
    • group: 右手、左手両方含むエリアを取り出す
  • --all_handsは、設定しないと、手を上げた場合だけ考慮になる、設定すると、全条件で考慮、なのでこれは設定する
python demo_bpf.py --input "../../data/2023-08-31 07.48.56.mp4" --output "../../data/2023-08-31 07.48.56_depthai_3.avi" --all_hands
Palm detection blob : C:\work\oak-d_test\depthai_hand_tracker\models\palm_detection_sh4.blob
Landmark blob       : C:\work\oak-d_test\depthai_hand_tracker\models\hand_landmark_lite_sh4.blob
In Duo mode, body_pre_focusing is forced to 'group'
Body pose blob      : C:\work\oak-d_test\depthai_hand_tracker\models\movenet_singlepose_thunder_U8_transpose.blob
Video FPS: 15
Original frame size: 2304x1296
Padding on height : 504
Frame working size: 2304x1296
896 anchors have been created
Creating pipeline...
Creating Body Pose Neural Network...
Creating Palm Detection Neural Network...
Creating Hand Landmark Neural Network...
Pipeline created.
[19443010819FF41200] [2.6] [1.248] [NeuralNetwork(3)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
Pipeline started - USB speed: SUPER
[19443010819FF41200] [2.6] [1.249] [NeuralNetwork(0)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.249] [NeuralNetwork(6)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.257] [NeuralNetwork(3)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.257] [NeuralNetwork(0)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.257] [NeuralNetwork(6)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
FPS : 15.1 f/s (# frames = 458)
# frames w/ no hand           : 0 (0.0%)
# frames w/ palm detection    : 35 (7.6%)
# frames w/ landmark inference : 457 (99.8%)- # after palm detection: 35 - # after landmarks ROI prediction: 422
On frames with at least one landmark inference, average number of landmarks inferences/frame: 1.20
# lm inferences: 549 - # failed lm inferences: 4 (0.7%)
Body pose estimation round trip      : 94.7 ms
Palm detection round trip            : 25.5 ms
Hand landmark round trip             : 24.5 ms

結果、それほど変わらず。
若干、両手が認識されるタイミングが変わったりしてる。
(載せた動画の範囲外)

モデル変更&BPF使用

python demo_bpf.py --input "../../data/2023-08-31 07.48.56.mp4" --output "../../data/2023-08-31 07.48.56_depthai_4.avi" --lm_model full --all_hands
Palm detection blob : C:\work\oak-d_test\depthai_hand_tracker\models\palm_detection_sh4.blob
Landmark blob       : C:\work\oak-d_test\depthai_hand_tracker\models\hand_landmark_full_sh4.blob
In Duo mode, body_pre_focusing is forced to 'group'
Body pose blob      : C:\work\oak-d_test\depthai_hand_tracker\models\movenet_singlepose_thunder_U8_transpose.blob
Video FPS: 15
Original frame size: 2304x1296
Padding on height : 504
Frame working size: 2304x1296
896 anchors have been created
Creating pipeline...
Creating Body Pose Neural Network...
Creating Palm Detection Neural Network...
Creating Hand Landmark Neural Network...
Pipeline created.
[19443010819FF41200] [2.6] [1.386] [NeuralNetwork(3)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
Pipeline started - USB speed: SUPER
[19443010819FF41200] [2.6] [1.386] [NeuralNetwork(0)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.387] [NeuralNetwork(6)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.395] [NeuralNetwork(3)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.395] [NeuralNetwork(0)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.395] [NeuralNetwork(6)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
FPS : 12.7 f/s (# frames = 458)
# frames w/ no hand           : 0 (0.0%)
# frames w/ palm detection    : 34 (7.4%)
# frames w/ landmark inference : 457 (99.8%)- # after palm detection: 34 - # after landmarks ROI prediction: 423
On frames with at least one landmark inference, average number of landmarks inferences/frame: 1.18
# lm inferences: 538 - # failed lm inferences: 3 (0.6%)
Body pose estimation round trip      : 97.9 ms
Palm detection round trip            : 25.8 ms
Hand landmark round trip             : 36.7 ms

なぜかScreenToGifで動画ファイル読み込みできなかったので、GIFなしだが、 大きな変化はなかった。
見た目判断だが。

ここまで

ピアノ動画では、右手側の鍵盤が太陽の光の反射で真っ白に飛んでるので、もしかしたら映りの大きさよりそっちのほうが問題かもしれない。