前回の続き。
やったこと
前回のhand_trackerのサンプルで、OAK-Dデバイスでなく動画ファイルを入力にできるとのことで、試した。
内部でMediaPipeが使用されているので、普通にMediaPipeを使うのでもいいかもしれないが、こちらではBPF(Body Pre Focusing)が用意されているので、結果が良くなるかもしれない。
前のピアノ弾いてる動画で。
環境
前回下記のようにクローンしたものをそのまま使う。
git clone https://github.com/geaxgx/depthai_hand_tracker
実施
python demo.py --input "../../data/2023-08-31 07.48.56.mp4" --output "../../data/2023-08-31 07.48.56_depthai_1.avi"
Palm detection blob : C:\work\oak-d_test\depthai_hand_tracker\models\palm_detection_sh4.blob
Landmark blob : C:\work\oak-d_test\depthai_hand_tracker\models\hand_landmark_lite_sh4.blob
Video FPS: 15
Original frame size: 2304x1296
Padding on height : 504
Frame working size: 2304x1296
896 anchors have been created
Creating pipeline...
Creating Palm Detection Neural Network...
Creating Hand Landmark Neural Network (2 threads)...
Pipeline created.
[19443010819FF41200] [2.6] [1.075] [NeuralNetwork(0)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
Pipeline started - USB speed: SUPER
[19443010819FF41200] [2.6] [1.076] [NeuralNetwork(3)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.082] [NeuralNetwork(0)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.082] [NeuralNetwork(3)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
FPS : 15.8 f/s (# frames = 458)
# frames w/ no hand : 1 (0.2%)
# frames w/ palm detection : 47 (10.3%)
# frames w/ landmark inference : 456 (99.6%)- # after palm detection: 46 - # after landmarks ROI prediction: 410
On frames with at least one landmark inference, average number of landmarks inferences/frame: 1.34
# lm inferences: 610 - # failed lm inferences: 97 (15.9%)
Palm detection round trip : 27.7 ms
Hand landmark round trip : 18.7 ms
今回はOAK-Dカメラは不要だが、接続しておかないとエラーになってしまった。
のでつなぎながらやっている。
以前MediaPipeを使ったときと特に変わらない感じ。
当然か?
ほとんど左手しか認識されていない。
モデル変更
-lm_model
オプションで、使うモデルを変えられるようなので、full
のモデルでやってみる。
python demo.py --input "../../data/2023-08-31 07.48.56.mp4" --output "../../data/2023-08-31 07.48.56_depthai_2.avi" --lm_model full
Palm detection blob : C:\work\oak-d_test\depthai_hand_tracker\models\palm_detection_sh4.blob
Landmark blob : C:\work\oak-d_test\depthai_hand_tracker\models\hand_landmark_full_sh4.blob
Video FPS: 15
Original frame size: 2304x1296
Padding on height : 504
Frame working size: 2304x1296
896 anchors have been created
Creating pipeline...
Creating Palm Detection Neural Network...
Creating Hand Landmark Neural Network (2 threads)...
Pipeline created.
[19443010819FF41200] [2.6] [1.127] [NeuralNetwork(0)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.128] [NeuralNetwork(3)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
Pipeline started - USB speed: SUPER
[19443010819FF41200] [2.6] [1.136] [NeuralNetwork(0)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.136] [NeuralNetwork(3)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
FPS : 13.8 f/s (# frames = 458)
# frames w/ no hand : 1 (0.2%)
# frames w/ palm detection : 47 (10.3%)
# frames w/ landmark inference : 456 (99.6%)- # after palm detection: 46 - # after landmarks ROI prediction: 410
On frames with at least one landmark inference, average number of landmarks inferences/frame: 1.29
# lm inferences: 588 - # failed lm inferences: 69 (11.7%)
Palm detection round trip : 27.5 ms
Hand landmark round trip : 28.8 ms
いまいち変わらないか、少し悪くなったか…
最初の動画と同じ範囲を切り出したので見えないが、
切り出し範囲外では、両手が認識されるタイミングが増えていた。
BPF使用
次、BPF付きのコードを実施。
いくつかオプションあるが、
--body_pre_focusing
オプションは、4パターンあるが、今回は両手の検出をしたい(Duoモード、-s
オプション設定しない場合のデフォルト)ので、group
に強制される、ので設定はなしright
: 右手だけ検出left
: 左手だけ検出higher
: 手を上げた場合だけ検出(肘との位置関係)group
: 右手、左手両方含むエリアを取り出す
--all_hands
は、設定しないと、手を上げた場合だけ考慮になる、設定すると、全条件で考慮、なのでこれは設定する
python demo_bpf.py --input "../../data/2023-08-31 07.48.56.mp4" --output "../../data/2023-08-31 07.48.56_depthai_3.avi" --all_hands
Palm detection blob : C:\work\oak-d_test\depthai_hand_tracker\models\palm_detection_sh4.blob
Landmark blob : C:\work\oak-d_test\depthai_hand_tracker\models\hand_landmark_lite_sh4.blob
In Duo mode, body_pre_focusing is forced to 'group'
Body pose blob : C:\work\oak-d_test\depthai_hand_tracker\models\movenet_singlepose_thunder_U8_transpose.blob
Video FPS: 15
Original frame size: 2304x1296
Padding on height : 504
Frame working size: 2304x1296
896 anchors have been created
Creating pipeline...
Creating Body Pose Neural Network...
Creating Palm Detection Neural Network...
Creating Hand Landmark Neural Network...
Pipeline created.
[19443010819FF41200] [2.6] [1.248] [NeuralNetwork(3)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
Pipeline started - USB speed: SUPER
[19443010819FF41200] [2.6] [1.249] [NeuralNetwork(0)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.249] [NeuralNetwork(6)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.257] [NeuralNetwork(3)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.257] [NeuralNetwork(0)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.257] [NeuralNetwork(6)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
FPS : 15.1 f/s (# frames = 458)
# frames w/ no hand : 0 (0.0%)
# frames w/ palm detection : 35 (7.6%)
# frames w/ landmark inference : 457 (99.8%)- # after palm detection: 35 - # after landmarks ROI prediction: 422
On frames with at least one landmark inference, average number of landmarks inferences/frame: 1.20
# lm inferences: 549 - # failed lm inferences: 4 (0.7%)
Body pose estimation round trip : 94.7 ms
Palm detection round trip : 25.5 ms
Hand landmark round trip : 24.5 ms
結果、それほど変わらず。
若干、両手が認識されるタイミングが変わったりしてる。
(載せた動画の範囲外)
モデル変更&BPF使用
python demo_bpf.py --input "../../data/2023-08-31 07.48.56.mp4" --output "../../data/2023-08-31 07.48.56_depthai_4.avi" --lm_model full --all_hands
Palm detection blob : C:\work\oak-d_test\depthai_hand_tracker\models\palm_detection_sh4.blob
Landmark blob : C:\work\oak-d_test\depthai_hand_tracker\models\hand_landmark_full_sh4.blob
In Duo mode, body_pre_focusing is forced to 'group'
Body pose blob : C:\work\oak-d_test\depthai_hand_tracker\models\movenet_singlepose_thunder_U8_transpose.blob
Video FPS: 15
Original frame size: 2304x1296
Padding on height : 504
Frame working size: 2304x1296
896 anchors have been created
Creating pipeline...
Creating Body Pose Neural Network...
Creating Palm Detection Neural Network...
Creating Hand Landmark Neural Network...
Pipeline created.
[19443010819FF41200] [2.6] [1.386] [NeuralNetwork(3)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
Pipeline started - USB speed: SUPER
[19443010819FF41200] [2.6] [1.386] [NeuralNetwork(0)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.387] [NeuralNetwork(6)] [warning] Network compiled for 4 shaves, maximum available 16, compiling for 8 shaves likely will yield in better performance
[19443010819FF41200] [2.6] [1.395] [NeuralNetwork(3)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.395] [NeuralNetwork(0)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[19443010819FF41200] [2.6] [1.395] [NeuralNetwork(6)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
FPS : 12.7 f/s (# frames = 458)
# frames w/ no hand : 0 (0.0%)
# frames w/ palm detection : 34 (7.4%)
# frames w/ landmark inference : 457 (99.8%)- # after palm detection: 34 - # after landmarks ROI prediction: 423
On frames with at least one landmark inference, average number of landmarks inferences/frame: 1.18
# lm inferences: 538 - # failed lm inferences: 3 (0.6%)
Body pose estimation round trip : 97.9 ms
Palm detection round trip : 25.8 ms
Hand landmark round trip : 36.7 ms
なぜかScreenToGifで動画ファイル読み込みできなかったので、GIFなしだが、
大きな変化はなかった。
見た目判断だが。
ここまで
ピアノ動画では、右手側の鍵盤が太陽の光の反射で真っ白に飛んでるので、もしかしたら映りの大きさよりそっちのほうが問題かもしれない。