1 00:00:00,216 --> 00:00:05,683 Hello, I am Masakazu Imamura from Osaka Prefecture University, Japan. 2 00:00:06,900 --> 00:00:11,566 The title of the talk is "Suitable camera and rotation navigation 3 00:00:11,566 --> 00:00:14,033 for people with visual impairment 4 00:00:14,033 --> 00:00:18,083 on looking for something using object detection technique." 5 00:00:20,100 --> 00:00:26,616 This is a joint work with Yoshihiko Inoue, Kazunori Minatani and Koichi Kise. 6 00:00:27,883 --> 00:00:31,500 One of the difficulties for people with visual impairment is 7 00:00:31,500 --> 00:00:34,750 that they have a limited access to visual information. 8 00:00:35,416 --> 00:00:41,066 For example, this person wants to drink a tea, but doesn't know which one is. 9 00:00:41,066 --> 00:00:47,683 In such a case, the person can use smartphone apps which can tell what it is 10 00:00:47,683 --> 00:00:50,150 like this. 11 00:00:50,433 --> 00:00:54,716 Apps that tell visual information can be categorized into two. 12 00:00:55,433 --> 00:00:59,616 Those in the first category can ask remote sighted people for help, 13 00:00:59,616 --> 00:01:03,900 which include VisWiz and be my eyes. 14 00:01:04,783 --> 00:01:08,733 Those in the second category use computer vision techniques, 15 00:01:08,733 --> 00:01:13,133 which include EnVision AI, TapTapSee, and Seeing AI. 16 00:01:13,700 --> 00:01:16,400 In this paper, we focus on the latter one. 17 00:01:16,400 --> 00:01:22,150 While the apps are powerful and useful, they have a limitation. 18 00:01:22,150 --> 00:01:27,183 That is, they assume the object is in front of the person. 19 00:01:27,183 --> 00:01:32,966 For example, in this case, the person wants to recognize tea, 20 00:01:32,966 --> 00:01:38,516 but actually the bottle of the tea is not in front of the person. 21 00:01:39,033 --> 00:01:44,216 In such a case. The person needs to find the bottle. 22 00:01:44,216 --> 00:01:50,150 However, the smartphone apps do not tell where it is. 23 00:01:50,350 --> 00:01:57,700 Let us summarize the situation. If what is unknown and where is known, 24 00:01:57,700 --> 00:02:00,766 which is named category (i), 25 00:02:00,766 --> 00:02:03,783 a representive task is obtaining the visual information 26 00:02:03,783 --> 00:02:07,483 on the object that the user photographs, 27 00:02:07,900 --> 00:02:11,850 which can be realized by current smartphone apps. 28 00:02:11,850 --> 00:02:18,483 And, if what is known and where is unknown, which is named Category (ii), 29 00:02:18,683 --> 00:02:22,650 a representative task is looking for something. 30 00:02:22,966 --> 00:02:26,416 And, a camera with a wide field of view, 31 00:02:26,416 --> 00:02:30,116 such as omnidirectional camera would be useful. 32 00:02:30,683 --> 00:02:35,133 For a person who is not familiar with omnidirectional camera, 33 00:02:35,300 --> 00:02:38,750 this is a device, which usually have multiple cameras, 34 00:02:38,750 --> 00:02:43,700 and this is an image taken by the omnidirectional camera. 35 00:02:44,283 --> 00:02:50,033 By extending the idea, we can also consider Category (iii) 36 00:02:50,033 --> 00:02:55,266 in which what is unknown and where is also unknown. 37 00:02:55,550 --> 00:03:01,966 A representative task is finding something valuable and unexpected to the user. 38 00:03:03,166 --> 00:03:09,300 In this case, we expect more visual information is required. 39 00:03:09,583 --> 00:03:14,166 But, the information provided to the user should be selected. 40 00:03:14,333 --> 00:03:18,366 In this paper, we focus on Category (ii). 41 00:03:19,133 --> 00:03:23,500 Specifically, the task of looking for something. 42 00:03:24,333 --> 00:03:27,316 Here is an overview of the research. 43 00:03:27,633 --> 00:03:29,933 In the task of looking for something, 44 00:03:29,933 --> 00:03:34,283 we implement a computer-vision-based prototype system 45 00:03:34,283 --> 00:03:37,600 that guides the user to reach the target object. 46 00:03:38,916 --> 00:03:42,833 We investigate whether an omnidirectional camera is 47 00:03:42,833 --> 00:03:46,100 more suitable than the normal camera, 48 00:03:47,383 --> 00:03:51,200 and we compare 5 rotation navigation methods, 49 00:03:51,200 --> 00:03:54,400 including 3 novel voice navigation methods. 50 00:03:55,766 --> 00:03:58,200 Here is the agenda. 51 00:03:59,066 --> 00:04:03,466 We completed introduction and are going into method. 52 00:04:04,133 --> 00:04:06,733 Let us present the prototype system 53 00:04:06,733 --> 00:04:12,350 that consists of a laptop computer and a camera system. 54 00:04:13,733 --> 00:04:17,766 The procedure of the prototype system is threefold. 55 00:04:17,866 --> 00:04:21,550 When the person wants to look for his backpack, 56 00:04:22,283 --> 00:04:27,116 in the first step, the prototype system finds a backpack. 57 00:04:27,600 --> 00:04:34,600 In the second step, on the spot, the system navigates the user to rotate 58 00:04:34,600 --> 00:04:39,633 so that the user is heading to the backpack. 59 00:04:40,183 --> 00:04:46,600 In the third step, the system let the user advance toward the target object, 60 00:04:49,833 --> 00:04:53,283 and stop in front of the object. 61 00:04:53,616 --> 00:04:57,050 We mainly focus on Step 2. 62 00:04:58,133 --> 00:05:01,950 There are some existing rotation navigation methods. 63 00:05:02,750 --> 00:05:08,050 Ahmetovic et al. examine 1 + 3 sound navigation methods 64 00:05:08,050 --> 00:05:11,016 in the context of turn-by-turn navigation. 65 00:05:12,333 --> 00:05:17,283 Here is the list of some navigation methods they examined. 66 00:05:18,683 --> 00:05:23,900 These three methods are also combined with the first one, Ping. 67 00:05:24,466 --> 00:05:28,483 Let me briefly introduce the sound navigation methods. 68 00:05:28,983 --> 00:05:37,000 First of all, Ping just let the user know when the target object is in front of the user. 69 00:05:37,366 --> 00:05:39,016 For example, 70 00:05:43,333 --> 00:05:44,616 like this. 71 00:05:46,833 --> 00:05:50,566 The second one is intermittent sound (IS). 72 00:05:50,566 --> 00:05:52,300 In this one, 73 00:06:01,433 --> 00:06:03,116 something like this. 74 00:06:03,116 --> 00:06:11,333 So, the interval is getting smaller, and in front of the object, the frequency is 15 Hz. 75 00:06:11,333 --> 00:06:14,216 That's what I couldn't reproduce. 76 00:06:14,216 --> 00:06:19,216 If it is 15 Hz, it should be something like pi pi pi pi pi pi. 77 00:06:20,633 --> 00:06:24,416 The third one is amplitude modulation (AM). 78 00:06:25,116 --> 00:06:32,183 In this one, a 440 Hz sinusoid is moderated by a sub-audio sinusoid 79 00:06:32,183 --> 00:06:34,633 from 1 to 15 Hz. 80 00:06:35,333 --> 00:06:38,700 It was also not easy to reproduce for me. 81 00:06:39,600 --> 00:06:42,466 The fourth one is musical scale (MS). 82 00:06:48,316 --> 00:06:52,800 In this one, the angle is subdivided into eight circular sectors, 83 00:06:52,800 --> 00:06:58,116 and every time the user enters a sector, it plays the piano sound. 84 00:07:00,216 --> 00:07:02,450 According to their experiment, 85 00:07:02,450 --> 00:07:08,200 they concluded IS and MS combined with Ping were the best. 86 00:07:08,933 --> 00:07:13,550 Here is the list of rotation navigation methods we examined, 87 00:07:14,033 --> 00:07:19,200 which include 3 voice navigation and 2 sound navigation methods. 88 00:07:19,200 --> 00:07:22,800 Voice navigation has not been used so far. 89 00:07:23,933 --> 00:07:27,683 Let me present the rotation navigation methods one by one. 90 00:07:29,200 --> 00:07:32,766 The first one is Left or right (LR), 91 00:07:33,666 --> 00:07:37,200 which is very simple. It just tells left or right. 92 00:07:38,183 --> 00:07:42,866 For example, if the person wants to look for his backpack, 93 00:07:43,216 --> 00:07:46,733 "right, right, right," "in front of you." 94 00:07:46,733 --> 00:07:48,900 Something like this. 95 00:07:50,883 --> 00:07:54,783 The second one is Angle (AG). 96 00:07:54,783 --> 00:08:01,733 This method tells the angle toward the target object, like this. 97 00:08:01,966 --> 00:08:03,583 "60 degrees, right," 98 00:08:05,866 --> 00:08:08,866 "15 degrees right," "in front of you." 99 00:08:10,533 --> 00:08:13,516 The third one is Clock position (CP). 100 00:08:13,516 --> 00:08:19,583 It is similar to AG, but tells by using the clock position. 101 00:08:19,866 --> 00:08:21,600 For example, 102 00:08:22,766 --> 00:08:24,416 "2 o'clock," 103 00:08:25,183 --> 00:08:26,550 "in front of you." 104 00:08:27,616 --> 00:08:30,400 The fourth one is intermittent beep (IB), 105 00:08:31,500 --> 00:08:33,183 which is close to IS. 106 00:08:46,233 --> 00:08:52,000 The difference from IS is that within 15 degrees towards the target object, 107 00:08:52,000 --> 00:08:56,033 it plays sound like pi pi pi pi pi pi pi. 108 00:08:58,033 --> 00:09:00,466 The fifth one is pitch (PT). 109 00:09:01,516 --> 00:09:02,766 This is: 110 00:09:09,283 --> 00:09:10,616 something like this. 111 00:09:12,400 --> 00:09:15,266 Let's move on to user study. 112 00:09:17,616 --> 00:09:20,966 The purposes of the user study is twofold. 113 00:09:21,816 --> 00:09:25,800 One is comparison of 5 rotation navigation methods. 114 00:09:26,616 --> 00:09:32,266 The other is selection of camera. We use 2 camera systems. 115 00:09:32,266 --> 00:09:34,850 One is omnidirectional camera, 116 00:09:35,516 --> 00:09:38,000 Which uses an omnidirectional camera. 117 00:09:38,550 --> 00:09:40,783 The other is pseudo smartphone. 118 00:09:41,250 --> 00:09:43,900 We assume that this is a smartphone, 119 00:09:43,900 --> 00:09:47,933 but for the fair comparison, we use the web camera. 120 00:09:49,933 --> 00:09:55,066 In addition, we use the same electronic compass for the rotation navigation, 121 00:09:55,066 --> 00:10:02,333 and the same depth camera for the forward navigation, for fair comparison. 122 00:10:02,966 --> 00:10:08,183 In the user study, 7 people with visual impairment participated. 123 00:10:08,183 --> 00:10:15,933 We carried out 2 experiments which are about the rotation navigation and camera. 124 00:10:18,016 --> 00:10:23,033 A survey, which is actually interview, for looking for something 125 00:10:23,033 --> 00:10:26,083 was performed before the experiments. 126 00:10:27,050 --> 00:10:31,983 But, because of time constraints of the presentation, I will omit this. 127 00:10:32,683 --> 00:10:37,166 On the right hand side, this is a snapshot in the experiment. 128 00:10:38,716 --> 00:10:43,183 He is the experimenter who has a laptop computer. 129 00:10:43,283 --> 00:10:48,550 And, he is a participant who has the camera system. 130 00:10:48,550 --> 00:10:50,383 This is the target object. 131 00:10:50,800 --> 00:10:55,950 This is the result of the experiment 1, about rotation navigation. 132 00:10:56,016 --> 00:11:03,316 In this experiment, a participant tried the 5 rotation navigation method one by one, 133 00:11:03,316 --> 00:11:09,400 and told us their preference in a 5-point scale. 134 00:11:09,400 --> 00:11:13,766 1 means the worst, and 5 means the best. 135 00:11:13,766 --> 00:11:16,016 As you can see, 136 00:11:17,033 --> 00:11:23,250 no one preferred LR because it does not tell anything about the rotation angle. 137 00:11:24,366 --> 00:11:30,500 However, regarding the remaining 4, everybody has a different preference. 138 00:11:30,500 --> 00:11:34,200 Every method was liked by at least one person. 139 00:11:35,366 --> 00:11:40,200 Regarding the phenomenon, related results were reported. 140 00:11:40,450 --> 00:11:43,183 In PerCom 2019, 141 00:11:43,516 --> 00:11:50,933 Ahmetovic et al. reported that musical experiments affect the users' behavior. 142 00:11:52,333 --> 00:11:55,400 In Web4All in 2019, 143 00:11:56,633 --> 00:12:02,083 Ahmetovic et al. reported expertise affects interaction preferences 144 00:12:02,083 --> 00:12:04,266 in navigation assistance. 145 00:12:05,116 --> 00:12:07,550 In the interview after the experiment, 146 00:12:07,550 --> 00:12:11,666 we found some interesting comments from the participants. 147 00:12:12,450 --> 00:12:14,933 Let me introduce a representative one. 148 00:12:16,516 --> 00:12:22,900 A participant told us his idea about the clock position navigation method. 149 00:12:23,216 --> 00:12:29,183 In our method, the front of the user is always 12 o'clock, 150 00:12:29,466 --> 00:12:32,566 and navigates the position of the target object. 151 00:12:32,566 --> 00:12:39,383 However, this person wants to put the target object at 12 o'clock, 152 00:12:39,950 --> 00:12:43,550 and wants to navigate the direction of the person. 153 00:12:44,633 --> 00:12:52,433 This indicates that everybody has different ideas about the world and navigation. 154 00:12:55,450 --> 00:13:00,766 This is the result of experiment 2, about the camera system. 155 00:13:01,583 --> 00:13:07,866 As you can see, 6 participants out of 7 preferred omnidirectional camera. 156 00:13:09,083 --> 00:13:11,933 Let me conclude the talk. 157 00:13:13,800 --> 00:13:19,950 In this talk, we categorize the tasks of obtaining visual information into three. 158 00:13:20,350 --> 00:13:24,450 We proposed a prototype system for looking for something 159 00:13:24,450 --> 00:13:30,700 that used an omnidirectional camera and the use of voice in rotation navigation. 160 00:13:32,316 --> 00:13:38,066 A user study comprised of 7 people with visual impairment confirmed that 161 00:13:38,966 --> 00:13:42,466 (1) an omnidirectional camera is preferred. 162 00:13:42,466 --> 00:13:47,550 (2) users have different preferences in rotation navigation. 163 00:13:48,783 --> 00:13:50,216 Future work. 164 00:13:50,216 --> 00:13:56,000 We found that the personalization of the prototype system is very important. 165 00:13:56,933 --> 00:14:04,783 For example, a participant wanted to find his or her own train pass, 166 00:14:04,783 --> 00:14:07,033 not the general one. 167 00:14:07,033 --> 00:14:12,783 This means that we need to personalize the object detector. 168 00:14:14,000 --> 00:14:17,666 Also we need to personalize the navigation method. 169 00:14:18,533 --> 00:14:20,283 From the interview, 170 00:14:20,733 --> 00:14:25,366 we also found that the existence of the object is 171 00:14:25,366 --> 00:14:29,683 far more important than navigation of the direction, 172 00:14:29,683 --> 00:14:32,216 and also navigation of the distance. 173 00:14:34,000 --> 00:14:39,200 If it takes time to detect the object or failed in detection, 174 00:14:39,200 --> 00:14:43,800 the user thinks the object does not exist in the room. 175 00:14:44,483 --> 00:14:48,466 So, we need to clarify the existence of the object. 176 00:14:49,866 --> 00:14:53,250 That's it. Thank you very much for your attention.