Human-in-the-Loop 场景应用
任务中断后继续
# 第一步:开始任务,遇到INFO action result = ask_agent_start_new_task( device_id=device_id, task="去淘宝帮我选一个生日礼物", # ... ) # 返回:stop_reason="INFO_ACTION_NEEDS_REPLY", session_id="xxx" # 第二步:用户提供回复后继续 result = ask_agent_continue( device_id=device_id, task=None, # 不需要重新指定任务 session_id="xxx", # 使用之前的会话ID reply_from_client="铜苹果", # 用户的回复 # ... )多任务切换
# 开始任务A result_a = ask_agent_start_new_task( device_id=device_id, task="打开微信并发送消息", # ... ) # 完成后开始不相关的任务B result_b = ask_agent_start_new_task( # 使用 start_new_task 重置环境 device_id=device_id, task="打开高德地图导航到公司", # ... )2.3.4 核心区别总结
环境状态:
- ask_agent_start_new_task:重置设备环境到初始状态
- ask_agent_continue:保持设备当前环境状态
会话管理:
- ask_agent_start_new_task:创建新会话
- ask_agent_continue:继续现有会话
使用时机:
- ask_agent_start_new_task:新任务、独立任务、需要干净环境
- ask_agent_continue:任务继续、保持上下文、Human-in-the-Loop
上下文连续性:
- ask_agent_start_new_task:无上下文连续性
- ask_agent_continue:保持任务上下文和应用状态
这种设计使得系统既能处理独立的离散任务,又能处理需要连续性的复杂任务,提高了任务执行的灵活性和效率。
2.4 代码
ask_agent_start_new_task 代码如下:
@mcp.tool def ask_agent_start_new_task( device_id: Annotated[str, Field(description="ID of the device to perform the task on. listed by list_connected_devices tool.")], task: Annotated[str | None, Field(description="The task that the agent needs to perform on the mobile device. if this is not None, the agent will try to perform this task. if None, the session_id must be provided to continue the previous session.")], # reset_environment: Annotated[bool, Field(description="Whether to reset the environment before executing the task, close current app, and back to home screen. If you want to execute a independent task, set this to True will make it easy to execute. If you want to continue the previous session, set this to False.")] = False, max_steps: Annotated[int, Field(description="Maximum number of steps the agent can take to complete the task.")] = 20, # session_id: Annotated[str | None, Field(description="Optional, session ID must provide when the last task endwith INFO action and you want to reply, the session id and device id and the reply from client must be provided.")] = None, # When the INFO action is called, how to handle it. # 1. "auto_reply": the INFO action will be handled automatically by calling the caption model to generate image captions. # 2. "no_reply": the INFO action will be ignored. THE AGENT MAY GET STUCK IF THE INFO ACTION IS IGNORED. # 3. "manual_reply": the INFO action will cause an interruption, and the user needs to provide the reply manually by input things in server's console. # 4. "pass_to_client": the INFO action will be returned to the MCP client to handle it. # reply_mode: Annotated[str, Field(description=''' # How to handle the INFO action during task execution. # Options: # - "auto_reply": Automatically generate image captions for INFO actions. # - "no_reply": Ignore INFO actions (may cause the agent to get stuck). # - "manual_reply": Interrupt and require user input for INFO actions. # - "pass_to_client": Pass INFO actions to the MCP client for handling. # ''')] = "auto_reply", # reply_from_client: Annotated[str | None, Field(description="If the last task is ended with INFO action, and you want to give GUI agent a reply, provide the reply here. If you do so, you must provide last session id and last device id.")] = None, ) -> dict: """ # Ask GUI Agent to start performing a new task on a connected device. Ask the GUI agent to perform the specified task on a connected device. The GUI Agent can be able to understand natural language instructions and interact with the device accordingly. The agent will be able to execute a high-level task description,if you have any additional requirements, write them down in detail at tast string. This function will reset the environment before executing the task, close current app, and back to home screen. if you have ## The agent has the below limited capabilities: 1. The task must be related to an app that is already installed on the device. for example, "打开微信,帮我发一条消息给张三,说今天下午三点开会"; "帮我在淘宝上搜索一款性价比高的手机,并加入购物车"; "to purchase an ea on Amazon". 2. The task must be simple and specific. for example, "do yyy in xxx app"; "find xxx information in xxx app". ONE THING AT ONE APP AT A TIME. 3. The agent may not be able to handle complex tasks that require multi-step reasoning or planning. for example. You may need to break down complex tasks into simpler sub-tasks and ask the agent to perform them sequentially. For example, instead of asking the agent to "plan a trip to Paris for xxx", you can ask it to "search for flights to Paris on xxx app", "find hotels in Paris on xxx app", make the plan yourself and ask agent to "sent the plan to xxx via IM app like wechat". 4. The agent connot accept multimodal inputs now. if you want to provide additional information like screenshot captions, please include them in the task description. ## Usage guidance: 1. you should never directly ask an Agent to pay or order anything. If user want to make a purchase, you should ask agent to stop brfore ordering/paying, and let user to order/pay. 2. tell the agent, if human verification is appeared during the task execution, the agent should ask Client. when the you see the INFO, you should ask user to handle the verification manually. after user says "done", you can continue the task with the session_id and device_id and ask the agent to continue in reply_from_client. 3. IF the last agentic call is failed or you want to perform a new task in different app, you should always use this function to start a new task, so that the environment will be reset before executing the task. Returns: dict: Execution log containing details of the task execution. with keys including - device_info: Information about the device used for task execution. - final_action: The final action taken by the agent to complete the task. - global_step_idx: The total number of steps taken during the task execution. - local_step_idx: The number of steps taken in the current session. - session_id: The session ID for maintaining context across multiple tasks. - stop_reason: The reason for stopping the task execution (e.g., TASK_COMPLETED_SUCCESSFULLY). - task: The original task description provided to the agent. """ reply_mode = "pass_to_client" # if task is not None: # assert session_id is None, "If task is provided, session_id must be None." # # New task, so reset_environment is True # reset_environment = True # else: # assert session_id is not None, "If task is None, session_id must be provided to continue the previous session." # # Continuing previous session, so reset_environment is False # reset_environment = False reset_environment = True return_log = execute_task( device_id=device_id, task=task, reset_environment=reset_environment, max_steps=max_steps, # enable_intermediate_logs=False, # enable_intermediate_image_caption=False, # enable_intermediate_logs=True, # enable_intermediate_image_caption=False, enable_intermediate_image_caption=True, enable_intermediate_screenshots=False, enable_final_screenshot=False, # enable_final_image_caption=False, enable_final_image_caption=True, reply_mode=reply_mode, session_id=None, # session_id=session_id, reply_from_client=None, # reply_from_client=reply_from_client, ) return return_logask_agent_continue 代码如下:
@mcp.tool def ask_agent_continue( device_id: Annotated[str, Field(description="ID of the device to perform the task on. listed by list_connected_devices tool.")], task: Annotated[str | None, Field(description="The task that the agent needs to perform on the mobile device. if this is not None, the agent will try to perform this task. if None, the session_id must be provided to continue the previous session.")], # reset_environment: Annotated[bool, Field(description="Whether to reset the environment before executing the task, close current app, and back to home screen. If you want to execute a independent task, set this to True will make it easy to execute. If you want to continue the previous session, set this to False.")] = False, max_steps: Annotated[int, Field(description="Maximum number of steps the agent can take to complete the task.")] = 20, # session_id: Annotated[str | None, Field(description="Optional, session ID must provide when the last task endwith INFO action and you want to reply, the session id and device id and the reply from client must be provided.")] = None, # When the INFO action is called, how to handle it. # 1. "auto_reply": the INFO action will be handled automatically by calling the caption model to generate image captions. # 2. "no_reply": the INFO action will be ignored. THE AGENT MAY GET STUCK IF THE INFO ACTION IS IGNORED. # 3. "manual_reply": the INFO action will cause an interruption, and the user needs to provide the reply manually by input things in server's console. # 4. "pass_to_client": the INFO action will be returned to the MCP client to handle it. # reply_mode: Annotated[str, Field(description=''' # How to handle the INFO action during task execution. # Options: # - "auto_reply": Automatically generate image captions for INFO actions. # - "no_reply": Ignore INFO actions (may cause the agent to get stuck). # - "manual_reply": Interrupt and require user input for INFO actions. # - "pass_to_client": Pass INFO actions to the MCP client for handling. # ''')] = "auto_reply", # reply_from_client: Annotated[str | None, Field(description="If the last task is ended with INFO action, and you want to give GUI agent a reply, provide the reply here. If you do so, you must provide last session id and last device id.")] = None, ) -> dict: """ # Ask GUI Agent to continue performing a task on a connected device, using previous context. Ask the GUI agent to perform the specified task on a connected device. The GUI Agent can be able to understand natural language instructions and interact with the device accordingly. The agent will be able to execute a high-level task description,if you have any additional requirements, write them down in detail at tast string. This function will **NOT** reset the environment before executing the task, so that the agent can continue the previous session. if you have ## The agent has the below limited capabilities: 1. The task must be related to an app that is already installed on the device. for example, "打开微信,帮我发一条消息给张三,说今天下午三点开会"; "帮我在淘宝上搜索一款性价比高的手机,并加入购物车"; "to purchase an ea on Amazon". 2. The task must be simple and specific. for example, "do yyy in xxx app"; "find xxx information in xxx app". ONE THING AT ONE APP AT A TIME. 3. The agent may not be able to handle complex tasks that require multi-step reasoning or planning. for example. You may need to break down complex tasks into simpler sub-tasks and ask the agent to perform them sequentially. For example, instead of asking the agent to "plan a trip to Paris for xxx", you can ask it to "search for flights to Paris on xxx app", "find hotels in Paris on xxx app", make the plan yourself and ask agent to "sent the plan to xxx via IM app like wechat". 4. The agent connot accept multimodal inputs now. if you want to provide additional information like screenshot captions, please include them in the task description. ## Usage guidance: 1. you should never directly ask an Agent to pay or order anything. If user want to make a purchase, you should ask agent to stop brfore ordering/paying, and let user to order/pay. 2. tell the agent, if human verification is appeared during the task execution, the agent should ask Client. when the you see the INFO, you should ask user to handle the verification manually. after user says "done", you can continue the task with the session_id and device_id and ask the agent to continue in reply_from_client. 3. IF the last agentic call is successful or the last action is INFO or the new task is related to the previous task, you can use this function to continue the task, so that the agent can finish the task faster by leveraging the previous context. dict: Execution log containing details of the task execution. with keys including - device_info: Information about the device used for task execution. - final_action: The final action taken by the agent to complete the task. - global_step_idx: The total number of steps taken during the task execution. - local_step_idx: The number of steps taken in the current session. - session_id: The session ID for maintaining context across multiple tasks. - stop_reason: The reason for stopping the task execution (e.g., TASK_COMPLETED_SUCCESSFULLY). - task: The original task description provided to the agent. """ reply_mode = "pass_to_client" # if task is not None: # assert session_id is None, "If task is provided, session_id must be None." # # New task, so reset_environment is True # reset_environment = True # else: # assert session_id is not None, "If task is None, session_id must be provided to continue the previous session." # # Continuing previous session, so reset_environment is False # reset_environment = False reset_environment = False return_log = execute_task( device_id=device_id, task=task, reset_environment=reset_environment, max_steps=max_steps, # enable_intermediate_logs=False, # enable_intermediate_image_caption=False, # enable_intermediate_logs=True, enable_intermediate_image_caption=True, enable_intermediate_screenshots=False, enable_final_screenshot=False, # enable_final_image_caption=False, enable_final_image_caption=True, reply_mode=reply_mode, session_id=None, # session_id=session_id, reply_from_client=None, # reply_from_client=reply_from_client, ) return return_log0x03 INFO 操作
3.1 INFO 操作的核心特性
INFO交互模式特殊性如下:
用户输入请求:INFO 操作是唯一需要用户主动输入的交互模式,与 CLICK、TYPE、AWAKE 等自动执行操作不同,INFO 需要中断自动化流程以获取用户反馈。
任务暂停机制:当执行 INFO 操作时,自动化流程暂停,系统会等待用户提供必要信息后继续执行,防止因缺少关键信息导致的错误操作
3.2 处理策略
INFO 操作有多种处理策略,具体在 reply_mode 中设置:
auto_reply:自动调用模型生成回复no_reply:忽略 INFO 操作,可能导致代理卡住manual_reply:手动输入回复pass_to_client:将 INFO 操作传递给 MCP 客户端处理
何处设置 reply_mode?具体如下:
- 在
execute_task函数中定义处理模式 gui_agent_loop函数根据reply_mode执行相应逻辑- 支持动态调整 INFO 操作处理方式
自动回复机制的细节如下:
auto_reply函数结合当前任务、截图和 INFO 操作内容- 使用 LLM 生成合适的回复内容
- 减少对用户手动输入的依赖
人工回复处理的细节如下:
manual_reply模式下,程序暂停并等待用户输入- 提供中英文提示信息来帮助用户理解需要回复的内容
- 验证用户输入的有效性
3.3 流程控制机制
INFO 的流程控制机制如下:
会话中断与恢复:
- INFO 操作触发时,
stop_reason设置为INFO_ACTION_NEEDS_REPLY - 保存当前会话状态,包括
session_id - 支持后续使用相同
session_id继续执行
- INFO 操作触发时,
回复传递机制:
- 用户回复通过
reply_from_client参数传递 - 在 payload 中作为
query字段传递给代理 - 代理将用户回复作为下一步操作的输入
- 用户回复通过
3.4 INFO 操作的实现细节
INFO 操作的信息传递流程如下:
从代理到用户:
- 代理生成 INFO 操作并包含 value(问题内容)
- action['value'] 被显示给用户
- 用户输入回复内容
从用户到代理:
- 用户输入通过
reply_from_client参数传递 reply_info变量存储用户回复- 作为
query字段传递给下一次automate_step调用
- 用户输入通过
3.5 INFO 操作的应用场景
INFO 操作的应用场景可能如下:
人机协作场景
验证码处理:
- 当遇到图形验证码或短信验证码时触发 INFO 操作
- 代理请求用户提供验证码
- 用户输入验证码后代理继续执行
敏感操作确认:
- 在执行支付、删除等敏感操作前,代理可能通过 INFO 操作请求用户确认
- 避免自动化操作导致的意外后果
信息补充场景
个性化信息获取:
- 代理需要获取用户的个人信息如姓名、地址等
- 通过 INFO 操作请求用户提供特定信息
- 完成表单填写等任务
决策支持:
- 当面临多个选项需要用户选择时
- 代理通过 INFO 操作询问用户偏好
- 根据用户选择继续执行相应路径
3.6 代码
INFO的相关代码如下:
def gui_agent_loop( # 省略代码 ): """ Evaluate a task on a device using the provided frontend action converter and action function. """ # 省略代码 action = uiTars_to_frontend_action(action) if action['action_type'].upper() == "INFO": if reply_mode == "auto_reply": print(f"AUTO REPLY INFO FROM MODEL!") reply_info = auto_reply(image_b64_url, task, action, model_provider=agent_loop_config['model_config']['model_provider'], model_name=agent_loop_config['model_config']['model_name']) print(f"info: {reply_info}") elif reply_mode == "no_reply": print(f"INFO action ignored as per reply_mode=no_reply. Agent may get stuck.") reply_info = "Please follow the task and continue. Don't ask further questions." # do nothing, agent may get stuck elif reply_mode == "manual_reply": print(f"EN: Agent asks: {action['value']} Please Reply: ") print(f"ZH: Agent 问你: {action['value']} 回复一下:") reply_info = input("Your reply:") print(f"Replied info action: {reply_info}") elif reply_mode == "pass_to_client": print(f"Passing INFO action to client for reply.") # break the loop and return to client for handling stop_reason = "INFO_ACTION_NEEDS_REPLY" break else: raise ValueError(f"Unknown reply_mode: {reply_mode}") # 省略代码 act_on_device(action, device_id, device_wm_size, print_command=True, reflush_app=reflush_app) history_actions.append(action) # 省略代码 if stop_reason in ['MANUAL_STOP_SCREEN_OFF', 'INFO_ACTION_NEEDS_REPLY', "NOT_STARTED"]: pass elif action['action_type'].upper() == 'COMPLETE': stop_reason = "TASK_COMPLETED_SUCCESSFULLY" elif action['action_type'].upper() == 'ABORT': stop_reason = "TASK_ABORTED_BY_AGENT"