(PDF) Multi-resolution Corrective Demons

Noname manuscript No. (will be inserted by the editor) Multi-Resolution Corrective Demonstration for Efficient Task Execution and Refinement C ¸ etin Meric¸li · Manuela Veloso · H. Levent Akın Received: date / Accepted: date Abstract Computationally efficient task execution is very a combination of MRTE and M+C. MRM+C learns how important for autonomous mobile robots endowed with lim- to select an appropriate detail resolution to operate at in ited on-board computational resources. Most robot control a given state from human demonstration. Furthermore, it approaches assume a fixed state and action representation, allows the teacher to provide corrective demonstration at and use a single algorithm to map states to actions. However, different detail resolutions to improve overall task execu- not all situations in a given task require equally complex tion performance. We provide formal definitions of MRTE, algorithms and equally detailed state and action representa- M+C, and MRM+C algorithms, and show how they relate tions. The main motivation for this work is a desire to reduce to general robot control problem and Learning from Demon- the computational footprint of performing a task by allow- stration (LfD) approach. We present experimental results de- ing the robot to run simpler algorithms whenever possible, monstrating the effectiveness of proposed methods on a goal- and resort to a more complex algorithm only when needed. directed humanoid obstacle avoidance task. We contribute the Multi-Resolution Task Execution (MRTE) Keywords Learning from Human Demonstration · Com- algorithm that utilizes human feedback to learn a mapping plementary Corrective Demonstration · Multi-Resolution from a given state to an appropriate detail resolution con- Task Execution sisting of a state and action representation, and an algorithm providing a mapping from states to actions at that resolu- tion. The robot learns a policy from human demonstration to 1 Introduction switch between different detail resolutions as needed while favoring lower detail resolutions to reduce computational Computational footprint of a robot controller is often over- cost of task execution. We then present the Model Plus Cor- looked as long as the controller is able to perform as ex- rection (M+C) algorithm to improve the performance of an pected on the robot. As robots become more ubiquitous and algorithm using corrective human feedback without modi- general-purpose, multiple software modules with different fying the algorithm itself. Finally, we introduce the Multi- purposes are likely to run on the robot simultaneously. De- Resolution Model Plus Correction (MRM+C) algorithm as spite the continuous advancements in the computational tech- nology that enables the mobile robots to be equipped with C ¸ . Meric¸li more and more powerful computers, the on-board computa- Computer Science Department tional resources to be shared among these multiple software Carnegie Mellon University components are still limited. E-mail:

[email protected]

Most robot control approaches consider a fixed state rep- M. Veloso resentation computed using the sensory input, an action rep- Computer Science Department Carnegie Mellon University resentation to be executed on the robot, and an algorithm for E-mail:

[email protected]

executing the task at hand that maps a given state into an H. L. Akın action. Employing a complex algorithm that uses the most Department of Computer Engineering detailed state representation and action definitions available Bo˘gazic¸i University might be computationally expensive and infeasible for con- E-mail:

[email protected]

tinuous use. Although some instances of the task might be 2 C ¸ etin Meric¸li et al. handled using simpler algorithms operating on less detailed task or skill execution [2]. The LfD methods make use of state representations and action definitions, a system that re- a teacher who demonstrates the robot how to perform the lies solely on such an algorithm might fail to capture the de- task or skill at hand while the robot records the demon- tails in more complex situations, and that might eventually strated actions along with the perceived state of the system lead to a failure in the task execution. synchronously. The robot then derives an execution policy We contribute the Multi-Resolution Task Execution al- to reproduce the demonstrated task or skill using the stored gorithm (MRTE), a general framework that employs a set of state-action pairs. There are two main methods in which the detail resolutions where each resolution has its own state and demonstrations can occur: action representations, and an algorithm using these repre- sentations to perform the task. A detail resolution selection – Learning from observation: In this category, the teacher policy is learned from human demonstration and used to de- performs the task or skill and the robot acquires the de- termine which detail resolution to operate at in a particular monstration examples passively through observing the state of the system. We then present Model Plus Correction teacher. This type of demonstration requires the robot (M+C), an algorithm based on our previous work for im- to be able to identify and map the teacher’s actions to proving the performance of an existing algorithm by aug- its own action set. This problem is also known as the menting it with corrective human feedback [13, 14]. Finally, correspondence problem. we combine MRTE with M+C into the Multi-Resolution – Learning from experience: In this category, the teacher Model Plus Correction (MRM+C) algorithm. Through multi- makes the robot execute the task or skill by means of ei- resolution corrective demonstration, MRM+C learns not only ther manipulating the body parts of the robot or through how to dynamically change the detail resolution at which to instructing the robot using its own action set. operate for a given state, but also how to improve the exe- cution performances of individual algorithms for each de- We define the LfD problem formally as an instance of tail resolution through multi-resolution corrective demon- general robot control problem using a tuple < Z, A, πdemo >. stration. The execution policy πdemo : Z → A is extracted from a Over the course of a training session, a teacher observes demonstration dataset D consisting of teacher demonstra- the robot executing the task using hand-coded algorithms, tions d ∈ D, where d =< z, ademo >, z ∈ Z, ademo ∈ A, and intervenes if the current algorithm needs a correction, and ademo is the action demonstrated by the teacher in the or if the detail resolution in use is too coarse to cope with observed state z. During execution, the robot uses πdemo the current situation. The robot learns a detail switching pol- for selecting the next action a based on the current observed icy for deciding which detail resolution to use in a particular state z. The execution model of generic learning from demon- state while also building up individual corrective demonstra- stration system is given in Fig. 1. tion databases for the algorithms at each detail resolution. During the autonomous execution of the task, the robot first ENVIRONMENT sensory input chooses the most convenient detail resolution to run at, and then computes the action to be performed in the perceived state at the selected detail resolution. z ademo actuation πdemo 2 Background and Related Work Teacher We define the general robot control problem formally as a tuple < Z, A, π >. The world consists of states S, and A is the set of actions the robot can take. The state is not fully z demonstration ateacher observable; instead, the robot has access to an observed state data Z through the mapping M : S → Z. The robot uses an execution policy π : Z → A for selecting the next action Fig. 1 The schematic representation of the generic LfD system. a ∈ A based on the current observed state z ∈ Z. Corrective demonstration is a form of teacher demon- 2.1 Learning from Demonstration stration focusing on correcting an action of the robot by proposing one of the following types of feedback: Learning from Demonstration (LfD) is a supervised learn- ing approach for transferring task or skill knowledge to an – An alternative action to be executed in that state autonomous robot by means of the demonstrations of the – A modification to the selected action Multi-Resolution Corrective Demonstration for Efficient Task Execution and Refinement 3 The usual form of employing corrective demonstration is threshold, the robot proceeds with the execution of the se- either through adding the corrective demonstration exam- lected action. Otherwise, it asks for teacher demonstration. ple to the demonstration dataset, or replacing an example The system builds a statistical model of the received demon- in the dataset with the corrective example. The action policy stration examples, and becomes more confident in situations is then re-derived using the updated demonstration dataset. where it has received a higher number of demonstrations. However, re-deriving the execution policy each time a The CBA approach reduces the need for teacher attendance, correction is received can be cumbersome if the total num- hence it makes the teaching process less tedious and time ber demonstrations is large. On the other hand, accumulat- consuming for the teacher. The main difference between the ing a number of corrective demonstration points, and then CBA approach and our proposed research is that instead of re-deriving the execution policy may be misleading or in- starting from scratch, our approach needs teacher feedback efficient since the demonstrator will not be able to see the only when the existing algorithms fail to behave properly. effect of the provided corrective feedback immediately. Simultaneous execution and feedback has been utilized for skill refinement through tactile correction. Argall et al. proposed a method for refining a demonstrated skill execu- 2.2 Related Work tion policy using kinesthetic feedback from the teacher dur- ing the execution of the skill using the execution policy ex- LfD is a rapidly growing field in robotics research as it en- tracted from the demonstration examples [3]. In the Tactile ables people who know how to perform a task but not how Policy Correction approach, if tactile feedback is detected, to program robots to transfer the task knowledge to the robot the policy is modified according to the received corrective through demonstration. Being a very active field of research, tactile feedback. This approach shares a similarity with our LfD approach has been utilized in many different learning proposed approach as both systems utilize provided human scenarios, focusing on different aspects of the learning. Here, feedback interleaved with the task or skill execution. we present a few representative studies and state how our ap- proach relates to them. Another method for learning low level skills from hu- Thomaz and Breazeal proposed a method for utilizing man demonstration through high level communication meth- human feedback as the reward signal for the Reinforcement ods is the Advice Operator Policy Improvement (A-OPI) ap- Learning (RL) system [16]. They used a simulated kitchen proach proposed by Argall et al. [1]. A set of defined ver- environment modeled as a Markov Decision Process (MDP) bal operators are associated with functional transformations where a robot tries to learn how to bake a cake. The human for low level robot motion. The teacher provides feedback teacher observes the robot operating, and provides a reward in the form of defined verbal operators and the correspond- signal at any time without interrupting the operation. The ing transformations are applied on the specified portion of notion of observing the robot executing the task and inter- the demonstration database. A new execution policy is then vening to provide feedback bears a resemblance with our re-derived out of the modified demonstration database. The proposed approach. However, they utilize the feedback as a A-OPI approach is evaluated on a trajectory learning task reward signal to an action selected by the robot whereas in using a Segway RMP robot platform. Kolter et al. proposed our proposed approach, the teacher provides corrective feed- a hierarchical apprenticeship learning approach for learning back on robot’s actions. The second main difference is that complex skills which are non-trivial even for the domain ex- our proposed approach updates its existing task definitions perts [9]. They propose a method that allows the teacher to to improve task execution performance while their approach provide advice at different hierarchical levels as providing utilizes the received feedback for training a RL system. They isolated advice for a smaller part of the skill is often eas- evaluated three different ways of making queries where each ier for the teacher. This approach shares similarities with of the methods differ in the conditions of when to ask teacher our multi-resolution task and skill refinement approach since for a demonstration using an upper-torso humanoid robot in one of the two key advantages of our multi-resolution ap- a concept learning task. They presented a user study where proach is the ability to cover a larger portion of the state- they evaluate the performance of the different ways of ask- action space with demonstration provided at a low detail ing for feedback against each other and against a baseline resolution. However, our approach differs from the hierar- supervised learning method. chical apprenticeship learning approach as it utilizes multi- Chernova and Veloso introduced an approach for learn- ple algorithms with different computational complexities to ing behavior policies from human demonstration called Con- handle different situations of the same task. [15] proposed a fidence Based Autonomy (CBA) [4]. The CBA approach hierarchical LfD approach on humanoid robots. The learned utilizes a confidence calculation mechanism for assessing behaviors are represented with a hierarchy of finite state ma- how confident the robot is about the action selected by its chines, where different sub-parts of the task hierarchy can be execution policy. If the confidence value is above a certain addressed during the demonstration. 4 C ¸ etin Meric¸li et al. Grollman and Jenkins propose a learning from demon- 3.1 Multi-Resolution Task Execution (MRTE) stration framework called “Dogged Learning” [7], and ap- plied it to learning quadruped walking and a set of skills We define Multi-Resolution Task Execution (MRTE) frame- related to playing soccer on a Sony AIBO. The Dogged work as a tuple: Learning algorithm share similarities with our approach on having multiple “boxes” generating output and an arbitra- < πarbitrator , {c1 , c2 , ..., cN } > tor for computing the final output but they use this approach for learning input-output associations whereas our approach where cr is the controller defined for the detail resolution utilizes human feedback to learn how to select the appropri- r ∈ R, and πarbitrator : Z → R is the detail resolution se- ate detail resolution, and to correct the underlying default lection policy. A controller for the detail resolution r is de- algorithms. fined as a tuple cr =< Zr , Ar , frstate , fraction , πmodel (r) >, where frstate : S → Sr is the function for mapping the Cobo et al. proposed a method for learning state ab- global state to the state definition at the detail resolution r, stractions from demonstration data [5]. Using two differ- and fraction : Ar → A is the function for mapping the action ent algorithms, they select a subset of the originally avail- computed at the detail resolution r into an action representa- able features that would yield least performance loss. Re- tion that can be executed by the robot. πmodel (r) is the task ducing the dimensionality of the state space improves the execution policy for each detail resolution r. learning performance of the RL system they use. Automati- The detail resolution selection component acts as an ar- cally selecting a subset of available features for constructing bitrator among the different detail resolutions, and decides a reward function for RL has also been investigated. Levine which detail resolution to operate at for computing the next et al. proposed an algorithm that selects relevant features action given the current state. A human teacher provides through building logical conjunctions of the features to the demonstrations to teach the robot when to switch into a finer example policy [11]. Meric¸li et al. introduced an automated detail resolution. Unless stated otherwise, the system always method for learning reward function as a linear combination runs at the most coarse detail resolution. In other words, the of available features [12]. They use Genetic Algorithms to system assumes that the current situation can be handled learn appropriate feature combination and use the learning by the simplest algorithm and state-action representations performance with the candidate reward function as the fit- unless a feedback for increasing the detail resolution is re- ness value. Our approach differs from all these approaches ceived. The schematic representation of the MRTE approach as in all these approaches the underlying assumptions are i) is given in Fig. 2. a properly learned single state representation, and ii) a single algorithm using the learned state representation would suf- fice whereas we advocate that different instances of the same task can be handled using different algorithms and different state representations. Fig. 2 The schematic representation of the MRTE approach. 3 Approach Our approach has two components: i) a multi-resolution task execution framework, and ii) a complementary corrective 3.2 Model Plus Correction (M+C) demonstration approach. In the remainder of the section, we first introduce our multi-resolution task execution frame- We present Model Plus Correction (M+C) complementary work. We then explain complementary corrective demon- corrective demonstration approach by extending the learn- stration for augmenting the algorithms employed at each de- ing from demonstration model. Our approach makes use of tail resolution with corrective human demonstration. Finally, an algorithm as the default controller, and utilizes corrective we present the extended multi-resolution task execution and human demonstration to further improve the performance of refinement approach that combines the multi-resolution al- the algorithm without changing the algorithm itself. We de- gorithm execution and complementary corrective demon- fine the M+C system as a tuple: stration approaches. < Z, A, πdemo , πmodel , freuse > Multi-Resolution Corrective Demonstration for Efficient Task Execution and Refinement 5 The LfD definition is extended with a model-based al- gorithm πmodel : Z → A, and a correction reuse func- tion freuse (z, ademo , amodel ) : Z × A × A → A, where ademo is the action computed by πdemo , and amodel is the action computed by πmodel . The correction reuse function computes the final action to be executed by the robot as a function of the current observed state, and the actions com- puted by the model-based and the demonstration policies. During training, a human teacher observes the robot per- forming the task using the model-based policy, and corrects Fig. 4 The schematic representation of the MRM+C approach. the action of the robot if the computed action is erroneous. The robot stores the received corrections as exceptions to current detail resolution as well as the representation of the the underlying algorithm. During execution, if an exception observed state at that resolution. The same user interface is for a similar situation is found in the correction database, the also used for delivering the action corrections and chang- robot substitutes the action computed by the algorithm with ing the detail resolution. The host computer communicates a demonstrated correction action. The schematic representa- with the robot over wireless Ethernet connection. The robot tion of the M+C system is given in Fig. 3. broadcasts its computed state, the current detail resolution, and other task-specific information back to the host com- puter at each step. The robot also uses a text-to-speech soft- ware system to announce the inferred state of the system and the action selected to be executed. The received state infor- mation of the robot is then visualized on the display. The state visualization part of the interface is task-specific. Along the course of a demonstration, the teacher ob- serves the robot as it executes the task, and intervenes by means of the following feedback types: – The elaborate command: This type of feedback switches to the next finer detail resolution. – The correct command: This type of feedback replaces the computed action to be executed with another action Fig. 3 The schematic representation of the M+C approach. defined in the same detail resolution. If an elaborate command is received, the system checks if there is a finer detail resolution available. If such a resolu- tion is found, the received elaborate command is stored with 3.3 Multi-Resolution Model Plus Correction (MRM+C) the current state of the system represented with the state def- inition for the finest detail resolution available. The system We combine M+C with MRTE and contribute the Multi- then switches to that detail resolution and goes back to the Resolution Model Plus Correction (MRM+C) algorithm to action computation step. If a correct command is received, learn how to reduce the computational cost of task execu- the action to be executed by the robot is replaced with the tion, and how to improve the overall task execution perfor- corrected action. The received action is also stored in a cor- mance from human feedback. We define MRM+C as a tuple rection database along with the observed system state at the < πarbitrator , {c1 , c2 , ..., cN } > current detail resolution. The algorithm for MRM+C train- where cr =< Z, A, fstate , faction , πdemo , πmodel , freuse > ing is given in Alg. 1. is an instance of a modified version of the M+C model de- fined as a tuple at the detail resolution r ∈ R. A schematic diagram of the Multi-Resolution M+C framework is given 3.5 Demonstration Reuse: Autonomous Execution in Figure 4. Each time the robot reaches an intermediate destination point during the autonomous task execution, the MRM+C algo- 3.4 Demonstration Delivery: Training the System rithm switches to the most coarse detail resolution available and computes an action using the algorithm associated with During the demonstration sessions, the teacher uses a cus- that detail resolution. Then, the system starts searching in tom user interface running on a host computer to access the this particular order: 6 C ¸ etin Meric¸li et al. Algorithm 1 The algorithm for training the MRM+C system Algorithm 2 The algorithm for MRM+C execution 1: resolution ← COARSEST 1: resolution ← COARSEST 2: state ← computeState(resolution) 2: state ← computeState(resolution) 3: action ← computeAction(state) 3: mostSimilar ← ∅ 4: executeAction(action) 4: maxSim ← 0 5: if f eedbackReceived() then 5: for each demo ∈ correctionDBresolution do 6: f eedback ← readF eedback() 6: sim ← getSimilarity(state, demo(state)) 7: if f eedback == ELABORAT E then 7: if sim > maxSim then 8: if resolution < F IN EST then 8: maxSim ← sim 9: saveDetailDemonstration() 9: mostSimilar ← demo 10: increaseResolution() 10: end if 11: goto 2 11: end for 12: end if 12: threshold ← getCorrectionT hreshold(resolution) 13: else if f eedback == CORRECT then 13: if maxSim > threshold then 14: action ← readCorrection() 14: action ← demo(action) 15: saveCorrectionDemonstration() 15: else 16: executeAction(action) 16: mostSimilar ← ∅ 17: end if 17: maxSim ← 0 18: end if 18: f State ← computeState(F IN EST ) 19: for each demo ∈ elaborationDB do 20: sim ← getSimilarity(f State, demo(f State)) – A correction sample in the demonstration database for 21: if sim > maxSim then 22: maxSim ← sim the current detail resolution 23: mostSimilar ← demo – An elaborate command in the elaboration demonstration 24: end if database for switching to the next detail level with finer 25: end for resolution. 26: threshold ← getElaborationT hreshold() 27: if maxSim > threshold then If a correction sample is found in the corrective demon- 28: if resolution < F IN EST then 29: increaseResolution() stration database that is received when the robot was in a 30: goto 2 state similar enough to the current state of the system, the 31: else provided correction action is selected as the next action. If 32: action ← computeAction(state) an elaborate command is received in a state similar enough 33: end if 34: else to the current state, the system changes its detail resolution 35: action ← computeAction(state) to the next finer detail resolution and recomputes an action 36: end if using the hand-coded algorithm specified for that resolution. 37: end if We use a domain and task specific similarity metric with 38: executeAction(action) empirically determined parameters. The algorithm for the autonomous MRM+C execution is given in Alg. 2. represent an example traversal of the course by the robot 4 Humanoid Obstacle Avoidance Task with the yellow circles denoting the intermediate destination points selected by the robot. We apply the proposed multi-resolution task execution and refinement approach on an obstacle avoidance task using a humanoid robot in a robot soccer environment. We define the obstacle avoidance task for a humanoid soccer robot as the problem of walking to a destination position without col- liding with various obstacles placed on the field. In our im- plementation, the robot starts in its own goal area and the goal of the task is to reach within 1 meter distance of the opponent goal. The number, shapes, and locations of the ob- stacles on the field are unknown to the robot beforehands. We use the regular field of the RoboCup Standard Platform League (http://www.robocup.org, http://www.tzi.de/spl) as Fig. 5 An example instance of the humanoid obstacle avoidance task the experimentation area, and Aldebaran Nao humanoid robot with an example solution in a configuration where two box-shaped ob- (http://www.aldebaran-robotics.com) as the robot platform. stacles and another humanoid robot are placed on the field. Fig. 5 presents an example instance of the humanoid ob- stacle avoidance task with three obstacles. The dashed lines Multi-Resolution Corrective Demonstration for Efficient Task Execution and Refinement 7 4.1 Free Space Modeling using Vision Instead of trying to detect the obstacles on the field, we build a free space model of the surroundings to decide which di- rection is the best to move to. The soccer field is a green carpet with white field lines on it. Therefore, anything that is non-green and lying on the field can be considered as an obstacle, except for the detected field lines (Fig. 5). We uti- (a) (b) lize a simplified version of the Visual Sonar algorithm by Lenser and Veloso [10] and the algorithm by Hoffmann et al. [8]. We scan the pixels on the image along evenly spaced vertical lines called scanlines, starting from the bottom end and continue until we encounter a certain number of non- green pixels. Although the exact distance of a certain pixel from the robot is a function of the position of the camera, the distance of a pixel increases as we ascend from the bottom of the image to the top, assuming all the pixels lie on the ground plane. If we do not encounter any green pixels along (c) a scanline, we consider that scanline as fully occupied. Oth- erwise, the point where the non-green block starts is marked Fig. 6 The environment as perceived by the robot: a) the color seg- as the end of the free space towards that direction. To further mented image, b) the computed perceived free space segments, and c) save some computation time, we do not process every ver- the resulting free space model. tical line on the image. Instead, we process the lines along every fifth pixel and every other pixel along those lines. As slots that fall within it is less than a certain threshold. In our a result, we effectively process only 1/10th of the image implementation, we use a threshold of 120 centimeters for (Fig. 6(b)). The pixels denoting the end of the free space are considering a free-space slot as occluded. The visualization then projected onto the ground to have a rough estimate of of the state in the low detail resolution is given in Fig. 7(a). the distance of the corresponding obstacle in the direction of At this detail resolution, the destination point can be se- the scanned line. In order to cover the entire 180o space in lected from among the five free space slot directions with a front of it, the robot pans its head from side to side. As the distance of 120 centimeters. However, the hand-coded algo- head moves, the computed free space end points are com- rithm for this resolution only selects from the three walking bined. The final computed free space is then divided into 15 directions: forward, left, or right. If the middle slot (the slot slots, each covering an arc of 12o in front of the robot. In the number 2) is free, the algorithm selects the forward direc- mean time, each free space slot is tagged with a flag indi- tion, otherwise it checks the right and left slots to decide. cating whether that slot points towards the opponent goal or The algorithm also favors the left direction over the right di- not based on the location of the opponent goal in the world rection, if the leftmost free-space slot (the slot number 4) is model, or the estimated location and orientation of the robot free. The destination point selection algorithm for the first on the field (Fig. 6(c)). Here, the dark triangles indicate the detail level is given in Alg. 3. free space slots pointing towards the opponent goal. For the humanoid obstacle avoidance task, we define three detail resolutions: R = {coarse, medium, f ine}. In Algorithm 3 Destination point selection algorithm for the the remainder of this section, we explain the state and action coarse detail resolution. definitions, and the destination point selection algorithms 1: slot ← −1 2: state ← getBooleanState(COARSE) for each detail resolution. 3: if ¬state(2) then 4: slot ← 2 5: else 4.2 Coarse Detail Resolution 6: if ¬state(0) then 7: slot ← 0 At the coarse detail resolution, the 180o space in front of 8: else 9: slot ← 4 the robot is divided into five equal slices of 36o each. The 10: end if existence of an obstacle along a free space slot is represented 11: end if with a boolean value in the state vector where true indicates 12: angle ← calculateDirection(slot) the slot is occupied. The slot is marked as occupied if the 13: distance ← 120 14: return calculateGlobalP oint(angle, distance) mean distance of the most detailed free space representation 8 C ¸ etin Meric¸li et al. (a) (b) (c) Fig. 7 Example visualizations for the state representations at different detail resolutions for the same situation. a) coarse detail resolution, b) medium detail resolution, and c) fine detail resolution. For the coarse and medium level resolutions, a green slot means no obstacle towards that direction, and a red slot means this direction is occluded by an obstacle. The target sign represents the selected destination point on the field according to the algorithm for that detail resolution. 4.3 Medium Detail Resolution sented with a distance value in centimeters. This value de- notes the distance to the closest detected obstacle lying within The state representation for the medium detail resolution the coverage of that particular free-space slot. Fig.7(c) shows uses the same principles as the coarse state representation an example visualization of the state representation for the with the exception of using nine slots instead of five. An ex- fine detail resolution. ample visualization of a medium detail resolution state rep- In addition to the destination direction, the algorithm for resentation is given in Fig. 7(b). the fine detail resolution also determines the walk distance The hand-coded algorithm for this resolution goes over towards that direction. We go over each free-space slot and each free-space slot and selects the direction of the closest for each slot we compute a weighted distance value using a available slot to the opponent goal as the destination direc- sliding window of size three with the weights 0.25 at both tion, using a fixed walking distance of 120 centimeters. The ends and 0.5 for the center. The direction of the free-space destination point selection algorithm for the medium detail slot with highest weighted distance is then selected as the resolution is given in Alg. 4. walking direction and the computed weighted distance is used as the walking distance. Algorithm 4 Destination point selection algorithm for the To be able to calculate the similarity of two given states medium detail resolution. in any of the detail resolutions, in this implementation we 1: state ← getBooleanState(M EDIU M ) use the following function: 2: goal ← getGoalSlot() 2 3: closestSlot ← 0 similarity = e−Kdiff 4: minDist ← 9 5: for slot ← 0; slot < 9; i ← slot + 1 do 6: if |goal − slot| ≤ minDist and¬state(slot) then where K is a coefficient for shaping the similarity function, 7: minDist ← |goal − slot| and diff is the calculated sum of absolute differences of the 8: closestSlot ← slot slot distances. For the boolean slots, dif f is 0, if both slot 9: end if 10: end for values are the same, and 1 otherwise. In our implementation, 11: angle ← calculateDirection(closestSlot) we use K = 5. 12: distance ← 120 13: return calculateGlobalP oint(angle, distance) 4.5 Demonstration Interface For the humanoid obstacle avoidance task, we use a task- 4.4 Fine Detail Resolution specific user interface for both monitoring the task execu- tion, and delivering feedback. The user interface visualizes At the finest detail resolution, the free space is divided into the perceived free-space information, the position of the robot 15 slots, and the occupancy status for each slot is repre- on the field, and the current selected destination point that Multi-Resolution Corrective Demonstration for Efficient Task Execution and Refinement 9 Algorithm 5 Destination point selection algorithm for the form of a global point on the field. Similarly, at the Medium fine detail resolution. and Coarse detail resolutions, it is not possible to specify a 1: goalAngle ← getGoalAngle() destination point by clicking on the visualized field. 2: if goalAngle < − π 2 or goalAngle > π2 then 3: if |angle0 − goalAngle| < |angleN −1 − goalAngle| then 4: destAngle ← angle0 5: else 5 Results 6: destAngle ← angleN −1 7: end if Our experimental evaluation of the proposed approach is 8: destDistance ← 120 two fold. In the first part, we present an experimental evalua- 9: else 10: maxDist ← 0 tion performed on a real RoboCup field with Nao humanoid 11: for i ← 1; i < N − 1; i ← i + 1 do robot. In the second part, we present an extensive experi- 12: distance ← 0.25disti−1 + 0.5disti + 0.25disti+1 mental analysis of the proposed approach on a simulated 13: if distance > maxDist then version of the humanoid obstacle avoidance task. 14: maxDist ← distance 15: maxSlot ← i 16: end if 17: end for 5.1 Real World Experiments 18: angle ← anglemaxSlot 19: distance ← maxDist We evaluated the performance of MRM+C approach against 20: end if 21: return calculateGlobalP oint(angle, distance) the hand-coded controllers at the lowest and the highest de- tail resolutions (Coarse Model and Fine Model algorithms) on the obstacle avoidance task using two different obstacle configurations, and an empty field as the base case (Fig. 9). We used the task completion time as the performance measure for the cases the robot was able to complete the task. The results are given in Table 1. We ran 5 trials per method for each configuration. The Rate column presents the success rate. The Time shows the average time it took the robot to complete the task for the successful trials. The units for the rate and the average time columns are percent- ages and seconds, respectively. An examination of the results yields that the success rate drops and the average task completion time gets longer as the number of obstacles increase, as expected. For the empty field configuration, all algorithms performed well in terms of success rate, while the hand coded algorithm for the fine detail resolution outperformed the others. The main reason Fig. 8 The user interface for delivering corrective demonstration to the robot. behind this result is since the fine resolution algorithm uses free space slot distances to compute the destination point, it selects a destination point very close to the opponent goal the robot walks to. A snapshot from the developed software and the task ends once the robot reach the destination so the is given in Fig. 8. robot does not lose any time in localizing itself and scan- The teacher uses the elaborate button to issue a detail ning the field for free space modeling. The performance of resolution refinement command. The current detail resolu- MRM+C was better than the coarse detail resolution algo- tion is also displayed on the screen. If the current detail reso- rithm but was worse than fine detail resolution algorithm lution is either Coarse or Medium, the teacher uses the radio mainly due to the number of field scans it has executed. buttons located on the bottom-right part of the interface. At For the single obstacle case, the performance of the coarse the Fine detail resolution, the user specifies the destination detail resolution algorithm degraded considerably but the point by clicking on the field visualization on the interface. fine detail resolution algorithm and MRM+C were able to There are 9 radio buttons placed on an arc, each represent- achieve high success rates. The fine detail resolution algo- ing a free space slot. For the Medium detail resolution, all rithm outperformed the MRM+C since it uses the most de- radio buttons are enabled. For the Coarse detail resolution, tailed state representation and computes long distance desti- every other button is enabled, reducing the number of en- nation points, yielding a smaller number of field scans. abled buttons to 5. At the Fine detail resolution, all radio For the three obstacles case, the coarse detail resolu- buttons are disabled as the system expects a correction in the tion algorithm was too simple to handle the case, and the 10 C ¸ etin Meric¸li et al. (a) (b) (c) Fig. 9 The obstacle configurations used in the experimental evaluation. a) empty field, b) a single obstacle placed on the center of the field, and c) three obstacles placed around the center circle. fine detail resolution algorithm was not able to compute the and the speed of the wheeled robot is limited to 10 cm/s, propoer actions in most of the times. Combining the use of which is roughly the speed of a real Nao robot. We imi- simpler algorithms when the current obstacle model does tate the self-localization information in the simulation with not yield the need for very detailed actions, and the cor- a global positioning system distorted with a certain amount rective demonstration actions provided by the teacher, the of white noise. MRM+C algorithm outperformed both hand-coded algorithms We used randomly distorted variations of two different despite a considerable performance degradation compared obstacle configurations with three obstacles each. For each to the previous configurations. trial, small random translational and rotational offsets are In 8 out of 15 failed trials, the failure was mostly due to applied on the obstacle positions in the configuration. the poor self localization data. The destination points com- puted by the algorithms are in global world coordinates; therefore, the performance gets heavily affected by the er- 80 Model ror in the estimated position. Model + Correction 75 Success Rate (%) Table 1 Performance evaluation results for the MRM+C approach in Humanoid Obstacle Avoidance domain. The times are in seconds. Empty Field 1 Obstacle 3 Obstacles 70 Method Rate Time Rate Time Rate Time Coarse Res. 80% 115 60% 195 0% N/A Fine Res. 100% 59 80% 94 40% 133 MRM+C 80% 96 100% 103 60% 182 65 60 Coarse Medium Fine Multi Res. Resolution 5.2 Simulation Experiments Fig. 10 The overall performance results for the individual algorithms To evaluate the proposed approach extensively without suf- and M+C instances for each detail resolution, along with the multi res- olution performances without (MRTE) and with (MRM+C) corrective fering from the indirect factors like the occasional self lo- demonstration. The performance is measured with the per cent of suc- calization errors in the real world experiments, we modeled ceeded runs. a simulated version of the humanoid obstacle avoidance do- main as the experimental testbed. We use the Player/Stage framework [6] to model the environment in 2D. We model We evaluated the hand-coded algorithms for each de- the Nao humanoid robot with an omnidirectional wheeled tail resolution (M), the improved algorithms using correc- robot base, and we use a laser range finder to emulate the tive demonstration (M+C), multi-resolution task execution vision-based free space perception used on the real Nao ro- using the hand-coded algorithms (MRTE), and the multi- bots. The laser range finder readings are processed and con- resolution corrective demonstration (MRM+C) algorithm. verted into the same format as the free-space detection mod- During the training session, 23 low level, 8 medium level, ule on the real Nao provides. The omnidirectional walk of and 14 high level demonstrations were collected for the cor- Nao is modeled as a holonomic motion on 2D ground plane rective demonstration part, and 22 demonstrations for chang- Multi-Resolution Corrective Demonstration for Efficient Task Execution and Refinement 11 ing the detail resolution were recorded for the detail resolu- rates for the MRTE and MRM+C algorithms are better than tion selection part. the low and medium level algorithms, and close to the high level algorithm. To be able to assess the effectiveness in 10 terms of the success rate versus the computational cost, we Coarse Model Action 9 Coarse Correction Action approximated the computational complexity of an algorithm Med. Model Action 8 Med. Correction Action as the total number of comparison and assignment opera- Fine Model Action 7 Fine Correction Action tions in the pseudo-code algorithms for the worst case exe- Avg. # of actions 6 cution scenario. The computational cost versus success rate 5 results are given in Fig. 12. In the figure, the horizontal axis 4 is the average number of operations per algorithm, com- 3 puted as the sum of estimated costs per detail resolution 2 multiplied by the average number of actions at that reso- 1 lution. For the execution of correction actions, we assume 0 Coarse M Coarse M+C Med M Med M+C Fine M Fine M+C MRTE MRM+C a flat cost regardless of the detail resolution. As it is seen in the figure, the success rate increases as the complexity of Fig. 11 The average number of actions executed per individual algo- the underlying algorithm increase. Also it is evident that the rithms, M+C instances, MRTE, and MRM+C systems. augmentation of complementary corrective demonstration actions through M+C approach improves the success rate at each detail resolution. Similarly, the multi resolution task execution with corrective demonstration (MRM+C) performs 80 better than MRTE. The success rate of the finest resolu- 78 tion algorithm with corrective demonstration (Fine M+C) is Fine M+C 76 higher than the MRM+C system, but the success rate / com- MRM+C putational cost ratio of MRM+C is better. Medium M+C Success Rate (%) 74 72 Fine Model Coarse M+C MRTE 6 Discussion and Conclusion 70 Medium Model 68 In this paper, we introduced Multi-Resolution Task Execu- Coarse Model 66 tion (MRTE) approach, a novel method for handling differ- ent situations in a task through running different algorithms 64 with varying complexities and using state and action rep- 62 resentations at different detail resolutions. The system fa- 60 vors the most coarse detail resolution by default, and learns 0 50 100 150 200 250 300 Cost (# of operations) how to dynamically change the detail resolution to operate at from corrective human feedback. To the best of our knowl- Fig. 12 The cost vs. performance plot for individual algorithms, M+C edge, this is the first study that uses human feedback to learn instances, MRTE, and MRM+C systems. a policy for changing its internal representation according to the observed state of the system. We ran 100 experiments for each algorithm. Fig. 10 shows We further elaborated on the proposed multi-resolution the success rates of the algorithms. The blue bar in the Multi task execution to allow the human teacher to provide cor- Resolution group is the success rate for the MRTE algo- rective feedback at different detail resolutions to improve rithm, and the red bar in the same group is the success rate execution performances of individual algorithms associated for the MRM+C algorithm. As expected, the success rate of with each detail resolution. We introduced Multi-Resolution the algorithms increase as the algorithm gets more complex Model Plus Correction (MRM+C) algorithm as a combina- and runs at a higher detail resolution. In all four configura- tion of the Multi-Resolution Task Execution (MRTE) and tions (three detail resolutions, and the multi-resolution exe- the Model Plus Correction (M+C) algorithms. M+C enables cution), the M+C instances outbested the algorithms alone, a human teacher to improve an algorithm through correc- and the MRM+C algorithm outperformed the MRTE algo- tive feedback without changing the algorithm itself. We pre- rithm. The composition of executed actions per evaluated al- sented formal models for the proposed approaches that re- gorithm is given in Fig. 11. In both the MRTE and MRM+C late the presented models with each other, and with the tra- evaluations, the majority of the executed actions were com- ditional LfD methods. puted by the low detail resolution algorithm with and low Our experimental results demonstrate the computational detail resolution demonstration database. Yet, the success efficiency of multi-resolution approach compared to con- 12 C ¸ etin Meric¸li et al. trollers with fixed state and action definitions. In the real on when to provide correction and when to switch into a world experiments, the MRM+C system outperformed the finer detail resolution, solely by interpreting the current ob- algorithm of the finest detail level, demonstrating the task served state displayed on the user interface. We hypothesize execution improvement using corrective demonstration. The that a possible way of improving the teacher’s understand- experiment results from the simulated version of the task ing of the perception of the world as seen by the robot is also confirms the improvement on the task execution per- to make the robot more interactive and give the teacher the formance in presence of corrective demonstration. Further- ability to inquiry robot state through dialog. more, the results demonstrate that the computational foot- We plan to evaluate the efficiency of our approach on print of the overall task execution can be reduced drastically a variety of more complex real world tasks to understand without critically suffering from the task execution perfor- and tackle the challenges in the teacher-robot interaction. mance. Furthermore, we plan to evaluate the robustness of our ap- Since our main motivation was to evaluate the effective- proach against uncertainty in sensing and action, and noisy ness of the underlying formal model, in this paper we as- demonstration data. sumed that the teachers have a technical understanding of the task at hand, and are able to use the custom user inter- References face for both monitoring the current state of the robot as well as for delivering feedback. We assume that the teacher 1. Argall, B., Browning, B., Veloso, M.: Learning robot motion con- will have no difficulties in interpreting the observed state of trol with demonstration and advice-operators. In: Proceedings of the system at different resolutions. Furthermore, we assume IROS’08 (2008) 2. Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A that the teacher can accurately foresee if a higher detail res- survey of robot learning from demonstration. Robotics olution would render the task doable in a given situation if and Automation Systems 57(5), 469–483 (2009). DOI the current computed action needs correction. As a result, http://dx.doi.org/10.1016/j.robot.2008.10.024 3. Argall, B.D., Sauser, E., Billard, A.: Tactile Guidance for Policy we assumed that the demonstration data collected through Adaptation. Foundations and Trends in Robotics 1(2), 79–133 the training session has good quality, and contains little or (2010) no wrong feedback actions. In our experimental evaluation, 4. Chernova, S., Veloso, M.: Interactive policy learning through the teacher was the first author for fulfilling the assumptions confidence-based autonomy. Journal of Artificial Intelligence Re- search 34 (2009) listed above. The teacher used the user interface to infer the 5. Cobo, L.C., Zang, P., Jr., C.L.I., Thomaz, A.L.: Automatic state state of the system, and chose the best action according to abstraction from demonstration. In: Proceedings of IJCAI 2011 his observations without having access to an optimal policy. 6. Gerkey, B.P., Vaughan, R.T., Howard, A.: The player/stage project: Tools for multi-robot and distributed sensor systems. In: In our current evaluation, the finest state representation Proceedings of ICAR 2003 is the set of all available state features, and we compute 7. Grollman, D., Jenkins, O.: Dogged learning for robots. In: Pro- coarser detail resolutions using the finest state representa- ceedings of ICRA 2007 tion. Therefore, in the current implementation, using finest 8. Hoffmann, J., J¨ungel, M., L¨otzsch, M.: A vision based system for goal-directed obstacle avoidance used in the rc’03 obstacle avoid- detail resolution for the elaboration commands does not cre- ance challenge. In: Proceedings of RoboCup 2004 Symposium ate a computational burden. However, in other tasks, the 9. Kolter, J.Z., Abbeel, P., Ng, A.Y.: Hierarchical apprenticeship finest detail resolution might require intensive computations. learning with application to quadruped locomotion. In: Proceed- ings of NIPS’07 (2007) To address this issue, we will investigate methods for using 10. Lenser, S., Veloso, M.: Visual sonar: Fast obstacle avoidance using the coarser detail resolutions for handling elaboration com- monocular vision. In: Proceedigns of IROS 2003 mands in the future. 11. Levine, S., Popovic, Z., Koltun, V.: Feature construction for in- verse reinforcement learning. In: Procs. of NIPS 2010, pp. 1342– Our approach provides a very convenient framework for 1350 mass deployment of robots into real world as it allows the 12. Mericli, C., Mericli, T., Akin, H.L.: A reward function generation robots to be shipped with a basic set of abilities, and allows method using genetic algorithms: A robot soccer case study. In: the users to tail the robot behavior according to the needs Proc. of AAMAS 2010 (2010) 13. Mericli, C., Veloso, M., Akin, H.L.: Task refinement for au- of their particular environmental or task conditions. There- tonomous robots using complementary corrective human feed- fore, our future plans include an experimental evaluation of back. International Journal of Advanced Robotic Systems (2011) our approach with people having varying levels of technical 14. Mericli, C., Veloso, M., Akin, H.L.: Improving biped walk stabil- understanding. Although the formal framework is suitable ity with complementary corrective demonstration. Autonomous Robots 32, 419–432 (2012) for any task or skill, we believe there is much room for im- 15. Sullivan, K., Luke, S., Ziparo, V.A.: Hierarchical learning from provement in the social interaction between the teacher and demonstration on humanoid robots. In: Humanoids 2010 Work- the robot. In particular, we expect non-technical everyday shop on Humanoid Robots Learning from Human Interaction (2010) users to have difficulties in interpreting the state of the world 16. Thomaz, A.L., Breazeal, C.: Reinforcement learning with human as observed by the robot. As a consequence, we expect un- teachers: evidence of feedback and guidance with implications for trained users to have difficulties in developing an intuition learning performance. In: Proceedings of AAAI 2006 Multi-Resolution Corrective Demonstration for Efficient Task Execution and Refinement 13 C¸ etin Meric¸li is a post-doctoral fellow in the Computer Science De- partment at Carnegie Mellon University. He received his Ph.D. from the Department of Computer Engineering at Bo˘gazic¸i University, Turkey in 2011. His research interests include robot learning from demon- stration, human-robot interaction, developmental robotics, multi-robot systems, robot soccer, and robot vision. Manuela Veloso is the Herbert A. Simon Professor of Computer Sci- ence in the Computer Science Department at Carnegie Mellon Uni- versity. She researches in artificial intelligence and robotics towards a vision of robots coexisting with humans in a seamless integration of intelligence. She directs the CORAL research laboratory, for the study of agents that Collaborate, Observe, Reason, Act, and Learn. Professor Veloso is the president of the AAAI (Association for the Advancement of Artificial Intelligence), and a trustee of RoboCup Federation. She received the 2009 ACM/Sigart Autonomous Agents Research Award. Professor Veloso is the author of one book on “Planning by Analogical Reasoning”. H. Levent Akın received his Ph.D. degree in Nuclear Engineering from Bo˘gazic¸i University in 1984 and he has been a professor in the Department of Computer Engineering, Bo˘gazic¸i University since 1989. He is the director of Artificial Intelligence Lab and the founder of the Robotics Group. He is currently Dean of Faculty of Engineering of Bogazici University. He is a trustee of RoboCup Federation. He is the chair of IEEE Computational Intelligence Society Turkey Chap- ter. His research interests include artificial intelligence, autonomous robots, and computational intelligence and he has published more than 100 papers on these topics.

(PDF) Multi-resolution Corrective Demonstration for Efficient Task Execution and Refinement