Ansible Automation Tool

Introduction

Ansible is an open-source automation platform that enables infrastructure as code, configuration management, application deployment, and orchestration across diverse IT environments. As an agentless automation tool, Ansible uses SSH for Linux/Unix systems and WinRM for Windows systems to execute tasks remotely without requiring software installation on managed nodes.

Core Automation Capabilities

Ansible addresses four primary automation domains within enterprise IT operations:

Business Value and Operational Impact

Organizations implement Ansible to achieve measurable improvements in operational efficiency and reliability:

Technical Architecture Overview

Ansible operates through a control node that executes automation against managed nodes using declarative language constructs. The architecture consists of:

Role-Based Responsibilities

Tier 1 Support: Execute pre-approved playbooks, monitor automation job status, and escalate failures following documented procedures.

Tier 2 Support: Troubleshoot playbook failures, modify existing automation, create simple playbooks, and manage inventory updates.

Tier 3 Support: Design complex automation workflows, develop custom modules, implement security policies, and architect enterprise Ansible deployments.

Training Objectives

This training enables technical staff to effectively operate, troubleshoot, and extend Ansible automation within enterprise environments. Upon completion, participants will demonstrate competency in playbook execution, basic troubleshooting, and escalation procedures appropriate to their support tier.

Audience & Scope

Primary Audience

This training is designed for IT operations professionals who need to understand, deploy, or troubleshoot Ansible automation in enterprise environments. The content assumes basic Linux command-line proficiency and fundamental networking concepts.

Prerequisites

Participants should have:

Training Scope

This training covers operational deployment and management of Ansible in production environments. The focus is on practical implementation rather than development of complex automation logic.

Included Topics

Excluded Topics

Role-Based Learning Paths

Tier 1 Support Focus

Tier 1 engineers will learn to:

Tier 2/3 Operations Focus

Senior operations staff will learn to:

Expected Outcomes

Upon completion, participants will be able to:

  1. Deploy Ansible in a production environment following security best practices
  2. Configure inventory management for dynamic and static infrastructure
  3. Execute and monitor automation workflows effectively
  4. Troubleshoot common operational issues using systematic approaches
  5. Implement appropriate escalation procedures when issues exceed their skill level
  6. Integrate Ansible operations with existing enterprise toolchains

Training Environment Requirements

Hands-on exercises require:

Role-Based Responsibilities (Tier 1 / Tier 2 / Tier 3 boundaries)

Tier 1 Support Responsibilities

Tier 1 support handles initial incident response and basic operational tasks that require minimal Ansible expertise.

Monitoring and Basic Troubleshooting

Information Gathering

Basic Remediation Actions

Escalation Triggers for Tier 1

Tier 2 Support Responsibilities

Tier 2 support handles complex troubleshooting, playbook analysis, and configuration modifications requiring intermediate Ansible knowledge.

Advanced Troubleshooting

Configuration Management

Performance Optimization

Escalation Triggers for Tier 2

Tier 3 Support Responsibilities

Tier 3 support handles expert-level issues, architecture decisions, and strategic automation development requiring deep Ansible expertise.

Architecture and Design

Advanced Development

Strategic Planning

Cross-Tier Communication Requirements

Escalation Information Package

When escalating between tiers, always include:

Knowledge Transfer Expectations

Learning Path (progressive modules)

This learning path provides a structured progression through Ansible concepts and skills, designed for technical professionals moving from basic automation tasks to advanced enterprise implementations.

Module 1: Foundation Concepts

Objective: Establish core understanding of Ansible architecture and terminology

Prerequisites: Basic Linux command line knowledge, SSH familiarity

Duration: 8-12 hours

Validation Exercise: Create a simple inventory file with 3 test servers and execute ansible --version command against all hosts.

Module 2: Ad-Hoc Commands and Basic Operations

Objective: Execute immediate tasks without playbooks

Prerequisites: Module 1 completion

Duration: 6-8 hours

Decision Prompt: You need to check disk space on 50 servers immediately. What would you do?

Answer: Use ad-hoc command: ansible all -m shell -a "df -h"

Module 3: Playbook Development Fundamentals

Objective: Create structured, repeatable automation scripts

Prerequisites: Module 2 completion

Duration: 12-16 hours

Scenario Example: Create a playbook that installs Apache, starts the service, and deploys a custom index.html file only on web servers in the inventory.

Common Mistake: Forgetting to use become: yes for tasks requiring root privileges. Always validate privilege requirements before execution.

Module 4: Advanced Playbook Features

Objective: Implement complex logic and error handling

Prerequisites: Module 3 completion

Duration: 10-14 hours

Validation Exercise: Build a playbook with error handling that attempts to start a service, captures failure, and sends notification on error.

Module 5: Inventory Management and Variables

Objective: Organize infrastructure and manage configuration data

Prerequisites: Module 4 completion

Duration: 8-10 hours

Decision Prompt: You have database passwords that need to be used in playbooks but kept secure. What approach would you use?

Answer: Use Ansible Vault to encrypt sensitive variables in separate files, referenced in playbooks.

Module 6: Roles and Content Organization

Objective: Structure reusable automation components

Prerequisites: Module 5 completion

Duration: 12-16 hours

Scenario Example: Convert an existing playbook into a reusable role that can be shared across multiple projects with different variable inputs.

Module 7: Enterprise Integration

Objective: Implement Ansible in production environments

Prerequisites: Module 6 completion

Duration: 14-18 hours

Role-Based Learning Tracks

Tier 1 Support Track: Modules 1-3, focus on executing existing playbooks and basic troubleshooting

Tier 2 Administrator Track: Modules 1-6, emphasis on playbook development and role creation

Tier 3 Architect Track: All modules, including enterprise integration and advanced optimization techniques

Escalation Triggers During Learning

Expected Completion Timeline: 8-12 weeks for full track completion with hands-on practice between modules.

Hands-On Labs (scenario-based)

Lab 1: Web Server Configuration

Objective: Deploy and configure Apache web servers across multiple hosts using Ansible playbooks.

Prerequisites:

Scenario: Your organization needs to deploy Apache web servers on three CentOS hosts with custom index pages and firewall rules.

Step-by-step Instructions:

  1. Create inventory file with target hosts:
    [webservers]
    web1.example.com
    web2.example.com
    web3.example.com
  2. Write playbook to install Apache:
    ---
    - name: Configure web servers
      hosts: webservers
      become: yes
      tasks:
        - name: Install Apache
          yum:
            name: httpd
            state: present
        
        - name: Start and enable Apache
          systemd:
            name: httpd
            state: started
            enabled: yes
  3. Add firewall configuration task
  4. Create custom index.html template
  5. Execute playbook with verbose output

Expected Result: Apache running on all three hosts with custom content accessible via HTTP.

Validation Steps:

What would you do? If one host fails during playbook execution, how would you troubleshoot and retry only the failed host?

Answer: Use --limit flag to target specific hosts and -vvv for detailed error output. Check SSH connectivity and sudo permissions first.

Lab 2: Database Server Deployment

Objective: Deploy MySQL database servers with security hardening and user management.

Prerequisites:

Scenario: Deploy MySQL on database servers with encrypted root passwords, create application databases, and configure backup users.

Step-by-step Instructions:

  1. Create encrypted vault file for sensitive data:
    ansible-vault create group_vars/dbservers/vault.yml
  2. Define database configuration variables
  3. Write playbook using mysql_user and mysql_db modules
  4. Include security hardening tasks (remove test databases, anonymous users)
  5. Execute with vault password prompt

Expected Result: Secure MySQL installation with application databases and restricted user access.

Validation Steps:

Common Mistakes:

Lab 3: Application Deployment Pipeline

Objective: Create end-to-end application deployment using roles and handlers.

Scenario: Deploy a Python web application with Nginx reverse proxy, including SSL certificates and monitoring configuration.

Step-by-step Instructions:

  1. Structure deployment using Ansible roles:
    roles/
    ├── common/
    ├── nginx/
    ├── python-app/
    └── monitoring/
  2. Configure role dependencies and variables
  3. Implement handlers for service restarts
  4. Use templates for configuration files
  5. Test deployment in staging environment first

What would you do? During deployment, the application fails to start due to a configuration error. How would you rollback and investigate?

Answer: Use tags to run only rollback tasks, check application logs, and validate configuration syntax before redeployment. Implement health checks in playbook.

Lab 4: Infrastructure Scaling

Objective: Dynamically scale infrastructure based on load requirements using dynamic inventory.

Scenario: Scale web tier by adding new instances and updating load balancer configuration automatically.

Step-by-step Instructions:

  1. Configure dynamic inventory for cloud provider
  2. Create playbook to provision new instances
  3. Update load balancer pool with new hosts
  4. Verify health checks pass before adding to rotation
  5. Implement graceful rollback if scaling fails

Expected Result: Additional capacity available with automated load balancer updates.

Tier 1 Responsibilities:

Escalation Triggers:

Tier 2/3 Responsibilities:

Lab 5: Disaster Recovery Scenario

Objective: Execute disaster recovery procedures using Ansible automation.

Scenario: Primary data center is unavailable. Restore services in secondary location using backup configurations and data.

Step-by-step Instructions:

  1. Activate disaster recovery inventory
  2. Restore database from automated backups
  3. Deploy application stack in recovery site
  4. Update DNS and load balancer configurations
  5. Validate all services are operational

Critical Validation Points:

What would you do? If database restoration fails due to corruption, what immediate actions should you take?

Answer: Immediately escalate to Tier 2, attempt restoration from previous backup point, document failure details, and activate manual procedures if available.

Decision Checkpoints ("What would you do?" with answers)

Scenario 1: Playbook Execution Fails on Multiple Hosts

Situation: You execute an Ansible playbook against 20 servers, and it fails on 8 of them with various error messages including "SSH connection timeout," "Permission denied," and "Module not found."

What would you do?

Correct Answer:

  1. Check the ansible output for specific error patterns
  2. Verify SSH connectivity to failed hosts using ansible all -m ping
  3. Review inventory file for correct hostnames/IPs
  4. Validate SSH keys and user permissions on target hosts
  5. Check if required Ansible modules are installed
  6. Re-run playbook with increased verbosity using -vvv flag

Reasoning: Multiple failure types suggest infrastructure or configuration issues rather than playbook logic problems. Systematic verification of connectivity and permissions addresses the most common failure causes.

Common Mistake: Immediately modifying the playbook code without first verifying basic connectivity and authentication.

Scenario 2: Playbook Runs Successfully But Changes Aren't Applied

Situation: Your Ansible playbook completes with "ok" status on all tasks, but when you check the target servers, the expected configuration changes are not present.

What would you do?

Correct Answer:

  1. Review the playbook output for "changed" vs "ok" status indicators
  2. Check if tasks are using check_mode or dry-run parameters
  3. Verify task conditions and when clauses aren't preventing execution
  4. Examine variable values using debug tasks
  5. Confirm you're targeting the correct hosts in your inventory
  6. Run with --diff flag to see what changes would be made

Reasoning: "OK" status typically means Ansible detected the desired state already exists, or tasks were skipped due to conditions. This requires investigating why changes weren't applied rather than assuming failure.

Common Mistake: Assuming the playbook is broken when it may be working correctly but conditions prevent changes.

Scenario 3: Inventory Host Groups Not Responding as Expected

Situation: You run a playbook targeting the "webservers" group, but it executes against database servers instead, or some expected web servers are missing from the execution.

What would you do?

Tier 1 Actions:

  1. Verify inventory file syntax and group definitions
  2. Use ansible-inventory --list to see how groups are resolved
  3. Check for duplicate hostnames in different groups
  4. Confirm you're using the correct inventory file with -i parameter

Escalate to Tier 2 if: Inventory structure requires reorganization or dynamic inventory sources need configuration.

Reasoning: Incorrect host targeting usually stems from inventory configuration issues that can be diagnosed through Ansible's built-in inventory tools.

Scenario 4: Task Hangs Without Completing

Situation: An Ansible task starts executing but appears to hang indefinitely without completing or failing. The playbook shows the task as "running" for over 30 minutes.

What would you do?

Immediate Actions:

  1. Check if the task involves long-running operations (package installations, file transfers)
  2. Verify network connectivity to target hosts hasn't been interrupted
  3. Review task for missing timeout parameters
  4. Check target host resources (CPU, memory, disk space)
  5. Examine target host processes to see if the task is actually running

Escalation Trigger: If task involves custom modules or complex operations requiring code analysis.

Reasoning: Hanging tasks often indicate resource constraints, network issues, or missing timeout configurations rather than Ansible bugs.

Scenario 5: Variable Values Not Resolving Correctly

Situation: Your playbook uses variables, but when executed, you see literal variable names (like "{{ app_version }}") in configuration files instead of the expected values.

What would you do?

Correct Answer:

  1. Check variable definition locations (group_vars, host_vars, playbook vars)
  2. Verify variable naming conventions and spelling
  3. Use debug tasks to print variable values before using them
  4. Check for proper YAML syntax in variable files
  5. Verify variable precedence isn't causing overrides
  6. Ensure templates use proper Jinja2 syntax

Common Mistake: Assuming variables are undefined when they may be defined but not accessible due to scope or precedence issues.

Scenario 6: Role Dependencies Causing Conflicts

Situation: After adding a new role to your playbook, existing roles begin failing with errors about conflicting handlers or duplicate task names.

Tier 1 Assessment:

Escalate to Tier 2 for: Role refactoring, dependency resolution, or architectural changes to eliminate conflicts.

Reasoning: Role conflicts typically require structural changes that go beyond basic troubleshooting and may impact multiple playbooks.

Definition of Done (clear completion criteria)

Establishing clear completion criteria ensures Ansible automation tasks are properly validated and meet operational standards before being considered complete.

Playbook Execution Completion

Objective: Verify playbook has executed successfully without errors or unexpected failures.

Success Criteria:

Validation Steps:

  1. Review final play recap for zero failures
  2. Check for any tasks marked as "ignored" and verify intentional
  3. Confirm all conditional tasks executed as expected
  4. Validate no connection timeouts or authentication failures

Configuration State Verification

Objective: Confirm target systems are in the desired configuration state.

Success Criteria:

Validation Commands:

# Service status verification
ansible all -m service -a "name=httpd" --check

# File content verification  
ansible all -m command -a "grep 'expected_value' /path/to/config"

# Port connectivity check
ansible all -m wait_for -a "port=80 timeout=10"

Idempotency Confirmation

Objective: Ensure playbook can be run multiple times without unintended changes.

Success Criteria:

Testing Process:

  1. Execute playbook in check mode: ansible-playbook playbook.yml --check
  2. Run playbook normally
  3. Execute again and verify no changes reported
  4. Compare system state before and after second run

Documentation and Compliance

Objective: Ensure proper documentation and adherence to organizational standards.

Success Criteria:

Role-Based Completion Authority

Tier 1 Authority:

Requires Tier 2/3 Approval:

Escalation Triggers

Escalate When:

Training Scenario: Completion Assessment

Scenario: Your playbook executed with the following recap:

PLAY RECAP *****************************
web01: ok=5 changed=2 unreachable=0 failed=0
web02: ok=5 changed=0 unreachable=0 failed=0
db01: ok=3 changed=1 unreachable=0 failed=0

Decision Point: Can this be marked as complete?

Correct Assessment: Potentially complete, but requires validation. The different "changed" counts between web01 and web02 need investigation. Verify why web01 had changes while web02 did not - this could indicate configuration drift or a legitimate difference in initial state.

Common Mistake: Marking complete based solely on "failed=0" without investigating why identical systems show different change counts.

Conceptual Model / Mental Model

The Ansible Paradigm

Ansible operates on a fundamentally different paradigm than traditional scripting or configuration management tools. Think of Ansible as a declarative language where you describe the desired end state rather than the specific steps to achieve it. This shift from "how to do something" to "what the final result should look like" is critical for understanding Ansible's power and limitations.

Core Mental Framework

Visualize Ansible as having three primary layers:

The control layer never installs agents on target systems. Instead, it pushes temporary Python modules over SSH, executes them, and removes them. This "agentless" model means targets only need SSH access and Python - no persistent Ansible processes run on managed nodes.

Idempotency as a Core Principle

Idempotency means running the same Ansible task multiple times produces the same result without unwanted side effects. A properly written Ansible task checks current state before making changes. If the system is already in the desired state, no action occurs. If changes are needed, Ansible applies only what's necessary to reach the target state.

Example mental model: Think of idempotency like a thermostat. You set it to 72°F. If the room is already 72°F, nothing happens. If it's 68°F, heat turns on until it reaches 72°F. Running the "set to 72°F" command repeatedly won't overheat the room.

Inventory as Your System Map

The inventory is Ansible's map of your infrastructure. It defines not just which systems exist, but how they're grouped and what variables apply to each. Think of inventory as creating logical relationships between physical or virtual resources. A single server might belong to multiple groups simultaneously (webservers, production, east-coast) and inherit variables from each group.

Tasks, Plays, and Playbooks Hierarchy

Understanding the hierarchy is essential:

Mental model: Think of a playbook as a recipe book, plays as individual recipes, and tasks as recipe steps. Each recipe (play) might serve different groups of people (host groups) but uses the same basic ingredients and techniques (modules).

Module System Philosophy

Modules are Ansible's building blocks - discrete units of functionality that handle specific operations. Each module is designed to be idempotent and handle error conditions gracefully. Modules abstract the complexity of different operating systems, package managers, and service managers behind consistent interfaces.

Key insight: You don't call system commands directly in well-designed Ansible automation. Instead, you use modules that understand the underlying system differences and handle edge cases appropriately.

State Management vs. Procedural Execution

Traditional scripts execute commands in sequence. Ansible evaluates desired state and determines necessary actions. This distinction affects how you approach problem-solving:

Error Handling and Recovery Model

Ansible's default behavior is to stop execution on a host when a task fails, but continue on other hosts. This "fail fast" approach prevents cascading errors while maintaining parallel execution benefits. Understanding this behavior is crucial for designing robust automation that handles partial failures gracefully.

Variable Precedence and Scope

Variables in Ansible follow a complex precedence hierarchy. Think of variables as having different "weights" - command-line variables override playbook variables, which override inventory variables, which override role defaults. Understanding this hierarchy prevents confusion when the same variable name appears in multiple locations with different values.

Push vs. Pull Architecture Implications

Ansible's push model means the control node initiates all actions. This differs from pull-based systems where agents periodically check for updates. The push model provides immediate execution and centralized control but requires the control node to reach all targets. Network connectivity, authentication, and timing all flow from control node to targets, never the reverse.

Architecture & Components

Control Node Architecture

The Ansible control node serves as the central management point where Ansible is installed and executed. This node contains the Ansible engine, inventory files, playbooks, and configuration files. The control node communicates with managed nodes via SSH (Linux/Unix) or WinRM (Windows) without requiring agent installation on target systems.

Key control node requirements include Python 2.7 or Python 3.5+ and SSH connectivity to managed nodes. The control node can be a physical server, virtual machine, or containerized environment depending on organizational needs.

Managed Node Components

Managed nodes are target systems that Ansible configures and manages. These nodes require minimal prerequisites: SSH service running, Python interpreter available, and network connectivity to the control node. Managed nodes do not require Ansible installation, making the architecture lightweight and scalable.

When Ansible executes tasks, it copies Python modules to managed nodes temporarily, executes them, and removes them upon completion. This agentless approach reduces maintenance overhead and security surface area.

Core Engine Components

The Ansible engine consists of several interconnected components:

Plugin Architecture

Ansible's plugin system extends core functionality through modular components:

Communication Flow

Ansible follows a push-based architecture where the control node initiates all communication:

  1. Control node reads inventory and playbook files
  2. Establishes connections to managed nodes via SSH/WinRM
  3. Copies required Python modules to temporary directories
  4. Executes modules on managed nodes
  5. Collects results and removes temporary files
  6. Processes output through callback plugins

Security Architecture

Ansible implements security through existing infrastructure components rather than introducing new authentication mechanisms. SSH key-based authentication provides secure, passwordless access to managed nodes. All communication occurs over encrypted channels using standard protocols.

The agentless design eliminates persistent processes on managed nodes, reducing attack surface. Privilege escalation uses existing mechanisms like sudo, su, or runas, maintaining consistency with organizational security policies.

Scalability Components

Ansible's architecture supports horizontal scaling through several mechanisms:

Integration Architecture

Ansible integrates with external systems through multiple touchpoints:

Role-Based Component Access

Tier 1 Responsibilities: Monitor control node status, verify SSH connectivity, check basic inventory accessibility, and validate playbook syntax using ansible-playbook --syntax-check.

Escalation Required: Control node configuration changes, plugin installation or modification, SSH key management, and inventory source modifications require Tier 2 involvement.

Tier 2/3 Responsibilities: Architecture design decisions, plugin development, security configuration, and integration with external systems.

Access, Authentication & Roles

Authentication Methods

Ansible supports multiple authentication mechanisms for connecting to managed nodes and accessing control systems. The primary methods include SSH key-based authentication, password authentication, and integration with external authentication systems.

SSH key-based authentication is the recommended approach for Linux/Unix systems. Ansible uses the control node's SSH client to establish connections, leveraging existing SSH configurations and key pairs. Password authentication serves as a fallback option but requires additional security considerations in production environments.

For Windows systems, Ansible utilizes WinRM (Windows Remote Management) with support for basic authentication, certificate-based authentication, and Kerberos integration for domain environments.

Access Control Framework

Ansible Tower and AWX provide comprehensive role-based access control (RBAC) systems that govern user permissions and resource access. The framework operates on three core components: users, teams, and roles.

Users represent individual accounts with specific credentials and permissions. Teams group users with similar responsibilities or organizational functions. Roles define permission sets that can be assigned to users or teams for specific resources.

Resource-level permissions control access to inventories, projects, job templates, credentials, and organizations. Permissions cascade through the organizational hierarchy, allowing administrators to implement granular access controls.

Role Types and Permissions

The system defines several built-in role types with predefined permission sets:

Credential Management

Ansible Tower stores and manages credentials securely using encryption at rest. Credential types include machine credentials for SSH access, cloud credentials for dynamic inventory, source control credentials for project synchronization, and vault credentials for encrypted variable files.

Machine credentials contain SSH private keys, usernames, passwords, and privilege escalation settings. These credentials can be associated with specific inventories or job templates to automate authentication during playbook execution.

Cloud credentials enable dynamic inventory synchronization and resource provisioning across various cloud platforms. Each cloud provider requires specific credential formats and permission scopes.

Authentication Configuration Tasks

Objective: Configure SSH key-based authentication for Ansible managed nodes

Prerequisites: Administrative access to control node, target node credentials, SSH client tools

Steps:

  1. Generate SSH key pair on control node: ssh-keygen -t rsa -b 4096 -f ~/.ssh/ansible_key
  2. Copy public key to target nodes: ssh-copy-id -i ~/.ssh/ansible_key.pub user@target_host
  3. Test connectivity: ssh -i ~/.ssh/ansible_key user@target_host
  4. Configure ansible.cfg with private key path: private_key_file = ~/.ssh/ansible_key
  5. Verify Ansible connectivity: ansible target_host -m ping

Expected Result: Successful SSH connection without password prompts and successful Ansible ping response

Validation: Execute ansible-inventory --list and ansible all -m setup --limit target_host to confirm authentication and fact gathering

Role Assignment Procedures

Objective: Assign appropriate roles to users for specific resources in Ansible Tower

Prerequisites: System Administrator or Organization Admin permissions, existing user accounts, defined resources

Steps:

  1. Navigate to Access Management section in Tower interface
  2. Select Users and locate target user account
  3. Click Permissions tab for the selected user
  4. Click Add button to assign new permissions
  5. Select resource type (Organization, Project, Inventory, Job Template)
  6. Choose specific resource instance from dropdown
  7. Select appropriate role type based on user requirements
  8. Save permission assignment
  9. Verify assignment appears in user's permission list

Expected Result: User can access assigned resources with specified permission level

Validation: Log in as target user and verify access to assigned resources matches role permissions

Common Authentication Scenarios

Scenario: A new team member needs access to execute existing playbooks for web server maintenance but should not modify configurations.

What would you do? Assign Execute role for specific job templates related to web server maintenance, avoiding Admin or Modify permissions.

Reasoning: Execute permissions allow job template execution while preventing unauthorized modifications to critical automation workflows.

Scenario: SSH authentication fails with "Permission denied (publickey)" error when running playbooks.

What would you do? Verify SSH key permissions (600 for private key), confirm public key installation on target hosts, and check SSH agent configuration.

Reasoning: SSH key authentication requires proper file permissions and key distribution to function correctly.

Tier Responsibilities

Tier 1 Responsibilities:

Escalation Required:

Tier 2/3 Responsibilities:

Common Authentication Mistakes

Avoid using shared credentials across multiple users or systems. Each user should have individual authentication credentials for proper audit trails and access control.

Do not store passwords in plain text within playbooks or inventory files. Use Ansible Vault for sensitive data encryption or leverage Tower's credential management system.

Prevent over-privileged access by assigning minimal required permissions. Regular access reviews help identify and remediate excessive permissions over time.

Ensure SSH key rotation follows organizational security policies. Stale or compromised keys create security vulnerabilities in automation systems.

Core Workflow (step-by-step, decision-driven)

Workflow Objective

Execute Ansible automation tasks following a systematic approach that ensures reliability, traceability, and proper escalation when issues arise.

Prerequisites

Step 1: Pre-Execution Planning

  1. Review the automation request and identify the target playbook
  2. Verify the inventory scope matches the intended targets
  3. Check for any maintenance windows or restrictions on target systems
  4. Determine if this is a standard operation or requires escalation

Decision Point: Is this a pre-approved, standard playbook execution?

Step 2: Dry Run Execution

  1. Execute the playbook in check mode first
  2. Review the planned changes output carefully
  3. Verify the scope matches expectations
  4. Document any unexpected results
ansible-playbook -i inventory playbook.yml --check --diff

Decision Point: Does the dry run output match expected changes?

Step 3: Production Execution

  1. Execute the playbook against the target inventory
  2. Monitor execution progress in real-time
  3. Watch for any failed tasks or unexpected errors
  4. Document the execution start time and job ID
ansible-playbook -i inventory playbook.yml

Decision Point: Did the playbook complete successfully without failures?

Step 4: Success Validation

  1. Review the execution summary for all hosts
  2. Verify that all intended changes were applied
  3. Run any post-execution validation checks specified in the runbook
  4. Update the change request or ticket with success status
  5. Document completion time and any notable observations

Tier 1 Responsibility: Complete validation steps and documentation. Workflow complete.

Step 5: Failure Handling

  1. Immediately stop any ongoing execution if safe to do so
  2. Capture the complete error output and logs
  3. Identify which hosts failed and which succeeded
  4. Check if partial success requires rollback procedures

Decision Point: Is this a known, recoverable error with documented resolution?

Common Decision Scenarios

Scenario 1: Partial Host Failures

What would you do? 5 out of 20 target hosts failed during playbook execution.

Correct Action: Document which hosts failed and the specific errors, then escalate to Tier 2. Do not retry without understanding the failure cause.

Reasoning: Partial failures may indicate environmental issues, permission problems, or host-specific configurations that require investigation.

Scenario 2: Connectivity Issues

What would you do? Playbook fails immediately with SSH connection errors to all hosts.

Correct Action: Verify network connectivity and SSH access manually to a sample host. If connectivity is confirmed down, escalate as a network issue. If access works manually, escalate as an Ansible configuration issue.

Reasoning: Distinguishing between network and configuration issues helps route the escalation appropriately.

Scenario 3: Unexpected Changes in Dry Run

What would you do? Check mode shows the playbook will modify 50 additional files not mentioned in the change request.

Correct Action: Stop the workflow and escalate to Tier 2 with the dry run output. Do not proceed with execution.

Reasoning: Scope creep in automation can have unintended consequences and requires review.

Escalation Triggers

Required Documentation

Top 10 Operational Tasks (How-To)

Task 1: Install and Configure Ansible Control Node

Applies to version(s): Ansible 2.9 through 6.x (ansible-core 2.12-2.15)

What this does: Sets up the primary Ansible control node where playbooks are executed and managed hosts are orchestrated from.

Prerequisites: Root or sudo access on a Linux system, Python 3.8 or higher installed, network connectivity to target managed hosts.

What to avoid: Do not install Ansible directly on production servers that will be managed by Ansible, as this creates circular dependency issues. Avoid using Python 2.x as it is deprecated and unsupported.

GUI method:

  1. GUI installation not available — Ansible control node installation requires command-line package management tools.

CLI method (Bash):

  1. Update package managersudo apt update (Ubuntu/Debian) or sudo yum update (RHEL/CentOS)
  2. Install Python pipsudo apt install python3-pip or sudo yum install python3-pip
  3. Install Ansible via pippip3 install ansible
  4. Verify installationansible --version
  5. Create Ansible directory structuremkdir -p ~/ansible/{playbooks,inventory,roles}
  6. Generate SSH key for host accessssh-keygen -t rsa -b 4096 -f ~/.ssh/ansible_key

What to look for: The ansible --version command should display version information including "ansible [core 2.xx.x]" and Python version. SSH key generation should create two files: ansible_key and ansible_key.pub.

How to verify success: Run ansible localhost -m ping and receive "localhost | SUCCESS" with pong response.

If something goes wrong: If "ansible: command not found" appears, add pip's bin directory to PATH with export PATH=$PATH:~/.local/bin. If SSH key generation fails, ensure the .ssh directory exists with mkdir -p ~/.ssh && chmod 700 ~/.ssh.

Task 2: Create and Manage Inventory Files

Applies to version(s): All Ansible versions support INI format inventory; YAML format supported in 2.4+

What this does: Defines which hosts Ansible will manage and organizes them into groups for targeted automation tasks.

Prerequisites: Ansible control node installed, text editor access, knowledge of target host IP addresses or hostnames.

What to avoid: Do not include passwords in plain text inventory files. Avoid using production hostnames in test inventory files to prevent accidental execution against production systems.

GUI method:

  1. GUI inventory management not available — Inventory files must be created and edited using text editors or CLI tools.

CLI method (Bash):

  1. Create inventory directorymkdir -p ~/ansible/inventory
  2. Create basic inventory filenano ~/ansible/inventory/hosts
  3. Add host groups in INI format — Enter group headers like [webservers] followed by host entries
  4. Add individual hosts — Enter <hostname_or_ip> ansible_user=<username> under appropriate group
  5. Save inventory file — Save and close the text editor
  6. Test inventory parsingansible-inventory -i ~/ansible/inventory/hosts --list
  7. Verify host connectivityansible -i ~/ansible/inventory/hosts all -m ping

What to look for: The ansible-inventory --list command should output JSON showing your defined groups and hosts. The ping test should return "SUCCESS" and "pong" for each reachable host.

How to verify success: Run ansible -i ~/ansible/inventory/hosts <group_name> --list-hosts and confirm all expected hosts appear in the output.

If something goes wrong: If "No hosts matched" appears, check inventory file syntax for missing brackets around group names or incorrect indentation. If SSH connection fails, verify the ansible_user has SSH key access with ssh -i ~/.ssh/ansible_key <ansible_user>@<host>.

Task 3: Write and Execute Basic Playbooks

Applies to version(s): YAML playbook format supported in all current Ansible versions

What this does: Creates automated task sequences that can be executed across multiple managed hosts for configuration management and deployment.

Prerequisites: Ansible control node configured, inventory file created, SSH access to target hosts established.

What to avoid: Do not use become: yes without specifying become_method and become_user in production environments. Avoid hardcoding sensitive values directly in playbook files.

GUI method:

  1. GUI playbook creation not available — Playbooks must be written in YAML format using text editors.

CLI method (Bash):

  1. Create playbook directorymkdir -p ~/ansible/playbooks
  2. Create new playbook filenano ~/ansible/playbooks/basic-setup.yml
  3. Add playbook header — Enter --- on first line, then - name: <playbook_description>
  4. Define target hosts — Add hosts: <group_name_or_all> with proper YAML indentation
  5. Add task section — Include tasks: followed by task definitions with - name: and module specifications
  6. Save playbook file — Save and close the text editor
  7. Validate playbook syntaxansible-playbook ~/ansible/playbooks/basic-setup.yml --syntax-check
  8. Execute playbook in dry-run modeansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/basic-setup.yml --check
  9. Execute playbookansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/basic-setup.yml

What to look for: Syntax check should return "playbook: <filename>" with no errors. Dry-run mode shows "PLAY RECAP" with "changed=X" indicating what would be modified. Actual execution shows "ok", "changed", or "failed" status for each task.

How to verify success: Check the "PLAY RECAP" section shows zero failures and expected number of changed tasks. Run echo $? immediately after playbook execution to confirm exit code 0.

If something goes wrong: If YAML syntax errors appear, check indentation uses spaces not tabs and colons are followed by spaces. If "UNREACHABLE" status appears, verify SSH connectivity and that the ansible_user has appropriate permissions on target hosts.

Task 4: Configure SSH Key Authentication for Managed Hosts

Applies to version(s): All Ansible versions require SSH access to managed hosts

What this does: Establishes passwordless SSH authentication between the Ansible control node and managed hosts for secure automated access.

Prerequisites: SSH key pair generated on control node, administrative access to target hosts, SSH service running on managed hosts.

What to avoid: Do not use the same SSH key for Ansible that is used for personal administrative access. Avoid copying private keys to multiple control nodes without proper key rotation procedures.

GUI method:

  1. GUI SSH configuration not available — SSH key deployment requires command-line tools for secure key transfer.

CLI method (Bash):

  1. Copy public key to target hostssh-copy-id -i ~/.ssh/ansible_key.pub <username>@<target_host>
  2. Test key-based authenticationssh -i ~/.ssh/ansible_key <username>@<target_host>
  3. Exit SSH sessionexit
  4. Update inventory with key pathnano ~/ansible/inventory/hosts
  5. Add SSH key parameter — Append ansible_ssh_private_key_file=~/.ssh/ansible_key to host entries
  6. Test Ansible connectivityansible -i ~/ansible/inventory/hosts <target_host> -m ping
  7. Configure SSH agentssh-add ~/.ssh/ansible_key

What to look for: The ssh-copy-id command should display "Number of key(s) added: 1". SSH login should not prompt for a password. Ansible ping should return "SUCCESS" and "pong" response.

How to verify success: Run ansible -i ~/ansible/inventory/hosts all -m setup -a "filter=ansible_hostname" and receive hostname facts from all managed hosts without password prompts.

If something goes wrong: If "Permission denied (publickey)" appears, verify the public key was added to the correct user's authorized_keys file with ssh <username>@<target_host> "cat ~/.ssh/authorized_keys". If SSH agent errors occur, start the agent with eval $(ssh-agent) before adding keys.

Task 5: Use Ansible Vault for Sensitive Data

Applies to version(s): Ansible Vault available in Ansible 1.5+ with enhanced features in 2.4+

What this does: Encrypts sensitive data like passwords, API keys, and certificates within Ansible files to maintain security while enabling automation.

Prerequisites: Ansible installed, playbooks or variable files containing sensitive data, secure storage for vault passwords.

What to avoid: Do not store vault passwords in version control systems or plain text files. Avoid using weak passwords for vault encryption or sharing vault passwords through insecure channels.

GUI method:

  1. GUI vault management not available — Ansible Vault operations require command-line interface for security.

CLI method (Bash):

  1. Create encrypted variable fileansible-vault create ~/ansible/group_vars/all/vault.yml
  2. Enter vault password — Provide a strong password when prompted (password will not display)
  3. Add encrypted variables — Enter sensitive variables in YAML format, save and exit editor
  4. Create password fileecho "<vault_password>" > ~/.ansible_vault_pass
  5. Secure password file permissionschmod 600 ~/.ansible_vault_pass
  6. View encrypted fileansible-vault view ~/ansible/group_vars/all/vault.yml --vault-password-file ~/.ansible_vault_pass
  7. Edit encrypted fileansible-vault edit ~/ansible/group_vars/all/vault.yml --vault-password-file ~/.ansible_vault_pass
  8. Run playbook with vaultansible-playbook <playbook.yml> --vault-password-file ~/.ansible_vault_pass

What to look for: Encrypted files begin with "$ANSIBLE_VAULT;1.1;AES256" followed by encrypted content. The ansible-vault view command should display decrypted YAML content. Playbook execution should access vaulted variables without errors.

How to verify success: Run cat ~/ansible/group_vars/all/vault.yml to confirm content is encrypted, then verify variables are accessible in playbooks by using debug tasks to display non-sensitive vault variables.

If something goes wrong: If "Decryption failed" appears, verify the correct password is being used and the vault file is not corrupted. If "ERROR! Attempting to decrypt but no vault secrets found" occurs, ensure the --vault-password-file parameter is included in playbook execution commands.

Task 6: Manage Services and Packages with Ansible Modules

Applies to version(s): Service and package modules available across all current Ansible versions with OS-specific variations

What this does: Automates installation, configuration, and management of system packages and services across multiple hosts for consistent system state.

Prerequisites: Ansible control node configured, managed hosts accessible, sudo privileges configured for the ansible user on target systems.

What to avoid: ⚠️ WARNING Do not use state: absent on critical system packages without testing in non-production environments first. Avoid restarting services during business hours without proper change control approval.

GUI method:

  1. GUI service management not available — Package and service management requires playbook execution through CLI.

CLI method (Bash):

  1. Create service management playbooknano ~/ansible/playbooks/service-management.yml
  2. Add package installation task — Include task with package: module, name: <package_name>, and state: present
  3. Add service management task — Include task with service: module, name: <service_name>, state: started, and enabled: yes
  4. Add become directive — Include become: yes at play level for privilege escalation
  5. Validate playbook syntaxansible-playbook ~/ansible/playbooks/service-management.yml --syntax-check
  6. Execute in check modeansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/service-management.yml --check
  7. Execute playbookansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/service-management.yml
  8. Verify service statusansible -i ~/ansible/inventory/hosts all -m service -a "name=<service_name>" --become

What to look for: Package installation shows "changed" status when installing new packages or "ok" when already present. Service tasks display "changed" when starting stopped services or "ok" when already running. Service verification shows "state: started" in the output.

How to verify success: Run ansible -i ~/ansible/inventory/hosts all -m shell -a "systemctl is-active <service_name>" --become and confirm "active" status returned from all hosts.

If something goes wrong: If "BECOME password required" appears, add ansible_become_pass to inventory or use --ask-become-pass flag. If package installation fails with "No package matching" error, verify the package name is correct for the target OS distribution using the appropriate package module (apt, yum, dnf).

Task 7: Collect System Facts and Generate Reports

Applies to version(s): Setup module available in all Ansible versions with expanded fact collection in 2.0+

What this does: Gathers comprehensive system information from managed hosts for inventory management, compliance reporting, and troubleshooting purposes.

Prerequisites: Ansible control node configured, SSH access to managed hosts, sufficient disk space for fact output storage.

What to avoid: Do not collect facts from large numbers of hosts simultaneously without rate limiting, as this can overwhelm network resources. Avoid storing fact output in version control due to sensitive system information.

GUI method:

  1. GUI fact collection not available — System fact gathering requires CLI execution and can output to various formats for reporting tools.

CLI method (Bash):

  1. Collect all facts from hostsansible -i ~/ansible/inventory/hosts all -m setup
  2. Filter specific fact categoriesansible -i ~/ansible/inventory/hosts all -m setup -a "filter=ansible_distribution*"
  3. Save facts to JSON fileansible -i ~/ansible/inventory/hosts all -m setup --tree ~/ansible/facts/
  4. Create fact reporting playbooknano ~/ansible/playbooks/fact-report.yml
  5. Add fact gathering task — Include gather_facts: yes and debug tasks to display specific facts
  6. Generate custom fact reportansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/fact-report.yml
  7. Export facts to CSV formatansible -i ~/ansible/inventory/hosts all -m setup -a "filter=ansible_hostname,ansible_distribution,ansible_memtotal_mb" | grep -E "(ansible_hostname|ansible_distribution|ansible_memtotal_mb)" > ~/ansible/system-report.txt

What to look for: Fact collection returns JSON-formatted data with "ansible_facts" containing system information. The --tree option creates individual JSON files named by hostname. Filtered facts show only requested information categories.

How to verify success: Check that fact files exist in the specified directory with ls -la ~/ansible/facts/ and verify JSON content is valid with python3 -m json.tool ~/ansible/facts/<hostname>.

If something goes wrong: If "Permission denied" errors occur during fact collection, verify the ansible user has read access to system files like /proc/meminfo and /etc/os-release. If fact gathering times out, increase the timeout value with -T 30 parameter or reduce the number of target hosts per execution.

Task 8: Deploy Configuration Files with Templates

Applies to version(s): Jinja2 templating available in all current Ansible versions

What this does: Creates dynamic configuration files using templates that incorporate host-specific variables and facts for consistent yet customized deployments.

Prerequisites: Ansible control node configured, template files created, target directories writable by ansible user, backup strategy for existing configuration files.

What to avoid: ⚠️ WARNING Do not deploy templates to production configuration files without testing and backup procedures. Avoid using undefined variables in templates as this will cause deployment failures.

GUI method:

  1. GUI template deployment not available — Template processing requires CLI playbook execution with Jinja2 rendering.

CLI method (Bash):

  1. Create templates directorymkdir -p ~/ansible/templates
  2. Create Jinja2 template filenano ~/ansible/templates/config.conf.j2
  3. Add template variables — Include Jinja2 syntax like {{ ansible_hostname }} and {{ custom_variable }}
  4. Define template variablesnano ~/ansible/group_vars/all/main.yml and add variable definitions
  5. Create template deployment playbooknano ~/ansible/playbooks/deploy-config.yml
  6. Add template task — Include template: module with src: config.conf.j2, dest: /path/to/config.conf, and backup: yes
  7. Test template renderingansible -i ~/ansible/inventory/hosts <host> -m template -a "src=~/ansible/templates/config.conf.j2 dest=/tmp/test-config.conf" --check
  8. Deploy templateansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/deploy-config.yml

What to look for: Template task shows "changed" status when deploying new or modified templates. The backup parameter creates .backup files with timestamps. Check mode displays the rendered template differences.

How to verify success: Run ansible -i ~/ansible/inventory/hosts all -m shell -a "cat /path/to/config.conf" to verify template variables were properly substituted with host-specific values.

If something goes wrong: If "AnsibleUndefinedVariable" errors appear, check that all template variables are defined in group_vars, host_vars, or playbook vars sections. If template deployment fails with permission errors, verify the destination directory exists and the ansible user has write permissions with appropriate become privileges.

Task 9: Execute Ad-Hoc Commands for Troubleshooting

Applies to version(s): Ad-hoc command functionality available in all Ansible versions

What this does: Runs immediate commands across multiple hosts for quick troubleshooting, system checks, and emergency response without creating formal playbooks.

Prerequisites: Ansible control node configured, SSH access to target hosts, appropriate privileges for commands being executed.

What to avoid: ⚠️ WARNING Do not execute destructive commands like rm, mkfs, or service stops without explicit approval. Avoid running commands that require interactive input as they will hang indefinitely.

GUI method:

  1. GUI ad-hoc execution not available — Ad-hoc commands require direct CLI execution for immediate response capabilities.

CLI method (Bash):

  1. Check system uptimeansible -i ~/ansible/inventory/hosts all -m shell -a "uptime"
  2. Verify disk spaceansible -i ~/ansible/inventory/hosts all -m shell -a "df -h"
  3. Check service statusansible -i ~/ansible/inventory/hosts all -m service -a "name=<service_name>" --become
  4. Gather specific system infoansible -i ~/ansible/inventory/hosts all -m setup -a "filter=ansible_memory_mb"
  5. Copy files to hostsansible -i ~/ansible/inventory/hosts all -m copy -a "src=/local/file dest=/remote/path"
  6. Execute with privilege escalationansible -i ~/ansible/inventory/hosts all -m shell -a "systemctl status <service>" --become
  7. Run on specific host groupansible -i ~/ansible/inventory/hosts <group_name> -m ping
  8. Set command timeoutansible -i ~/ansible/inventory/hosts all -m shell -a "long-running-command" -T 60

What to look for: Successful commands return "SUCCESS" status with command output. Failed commands show "FAILED" status with error messages. Unreachable hosts display "UNREACHABLE" with connection details.

How to verify success: Check that all expected hosts respond with "SUCCESS" status and review command output for expected results. Use echo $? to verify the ansible command itself completed with exit code 0.

If something goes wrong: If commands timeout, increase the timeout value with -T <seconds> or break complex commands into smaller operations. If "MODULE FAILURE" appears, verify the module name is correct and the target hosts have required dependencies installed (like python for shell module).

Task 10: Monitor and Parse Ansible Logs

Applies to version(s): Logging functionality available in all Ansible versions with enhanced options in 2.0+

What this does: Configures comprehensive logging and monitors Ansible execution for troubleshooting, compliance auditing, and performance analysis.

Prerequisites: Ansible control node configured, write permissions to log directories, log rotation tools available for long-term log management.

What to avoid: Do not log to directories without sufficient disk space as this can fill filesystems. Avoid logging sensitive data like passwords or API keys in verbose mode output.

GUI method:

  1. GUI log monitoring not available — Ansible logging requires CLI configuration and file-based log analysis tools.

CLI method (Bash):

  1. Create log directorymkdir -p ~/ansible/logs
  2. Configure Ansible loggingexport ANSIBLE_LOG_PATH=~/ansible/logs/ansible.log
  3. Enable verbose loggingexport ANSIBLE_DEBUG=True
  4. Make logging persistentecho "export ANSIBLE_LOG_PATH=~/ansible/logs/ansible.log" >> ~/.bashrc
  5. Execute playbook with loggingansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/<playbook.yml> -v
  6. Monitor real-time logstail -f ~/ansible/logs/ansible.log
  7. Search for specific eventsgrep -i "failed\|error" ~/ansible/logs/ansible.log
  8. Parse execution timesgrep "PLAY RECAP" ~/ansible/logs/ansible.log
  9. Rotate log fileslogrotate -f ~/ansible/logrotate.conf (after creating appropriate logrotate configuration)

What to look for: Log entries include timestamps, log levels (DEBUG, INFO, WARNING, ERROR), and detailed execution information. Failed tasks appear with "FAILED" status and error details. Successful completions show "PLAY RECAP" with execution statistics.

How to verify success: Confirm log file exists and contains recent entries with ls -la ~/ansible/logs/ and tail ~/ansible/logs/ansible.log. Verify log rotation prevents excessive disk usage.

If something goes wrong: If no logs appear, verify the ANSIBLE_LOG_PATH directory exists and is writable with touch ~/ansible/logs/test.log. If logs contain permission errors, check that the ansible user has appropriate access to create files in the specified log directory and consider using sudo for log directory creation.

Top 10 Administrative Tasks (How-To)

Task 1: Install and Configure Ansible Control Node

Applies to version(s): Ansible 2.9 through 6.x (ansible-core 2.12-2.15)

What this does: Sets up the primary Ansible control node from which all automation tasks will be executed and managed.

Prerequisites: Linux system with Python 3.8+ installed, sudo access, network connectivity to target hosts.

What to avoid: Do not install Ansible on Windows as a control node - it is not supported. Do not use Python 2.7 as it is deprecated and will cause compatibility issues.

GUI method:

  1. No GUI method available — Ansible control node installation requires command-line interface.

CLI method (Bash):

  1. Update package manager — Run sudo apt update && sudo apt upgrade -y on Ubuntu/Debian or sudo yum update -y on RHEL/CentOS
  2. Install Python pip — Run sudo apt install python3-pip -y or sudo yum install python3-pip -y
  3. Install Ansible via pip — Run pip3 install ansible
  4. Verify installation — Run ansible --version
  5. Create Ansible directory structure — Run mkdir -p ~/ansible/{playbooks,inventory,roles}

What to look for: The ansible --version command should display version information including ansible-core version, config file location, and Python version. Directory creation should complete without errors.

How to verify success: Run ansible localhost -m ping and receive a successful pong response with "changed": false status.

If something goes wrong: If pip installation fails, install using package manager with sudo apt install ansible. If Python version conflicts occur, use python3 -m pip install --user ansible to install in user space.

Task 2: Create and Manage Inventory Files

Applies to version(s): All Ansible versions support INI format inventory; YAML format supported in 2.4+

What this does: Defines target hosts and groups that Ansible will manage, enabling organized automation across infrastructure.

Prerequisites: Ansible control node installed, text editor access, knowledge of target host IP addresses or hostnames.

What to avoid: Do not include passwords in plain text inventory files. Do not use spaces in group names as this causes parsing errors.

GUI method:

  1. No native GUI method — Use any text editor to create inventory files manually.

CLI method (Bash):

  1. Create inventory file — Run nano ~/ansible/inventory/hosts
  2. Add host groups — Enter INI format: [webservers] followed by host entries
  3. Define individual hosts — Add lines like <hostname_or_ip> ansible_user=<username>
  4. Add group variables — Create section [webservers:vars] and add common variables
  5. Test inventory parsing — Run ansible-inventory -i ~/ansible/inventory/hosts --list

What to look for: The ansible-inventory --list command should output JSON format showing all hosts organized by groups with no parsing errors.

How to verify success: Run ansible all -i ~/ansible/inventory/hosts --list-hosts to see all managed hosts listed correctly.

If something goes wrong: If parsing fails, check for missing brackets around group names or invalid YAML syntax. If hosts are unreachable, verify SSH connectivity with ssh <username>@<hostname> manually.

Task 3: Configure SSH Key Authentication

Applies to version(s): All Ansible versions - SSH is the default connection method

What this does: Establishes passwordless SSH authentication from control node to managed hosts for secure automated connections.

Prerequisites: SSH client installed on control node, user accounts on target hosts, network connectivity on port 22.

What to avoid: Do not disable SSH host key checking globally in production - this creates security vulnerabilities. Do not use weak SSH key algorithms like DSA.

GUI method:

  1. No GUI method available — SSH key generation and distribution requires command-line tools.

CLI method (Bash):

  1. Generate SSH key pair — Run ssh-keygen -t rsa -b 4096 -C "ansible-control-node"
  2. Accept default location — Press Enter when prompted for file location to use ~/.ssh/id_rsa
  3. Set empty passphrase — Press Enter twice for empty passphrase (required for automation)
  4. Copy public key to target host — Run ssh-copy-id <username>@<target_host>
  5. Test passwordless connection — Run ssh <username>@<target_host>

What to look for: SSH key generation should display key fingerprint and randomart image. ssh-copy-id should show "Number of key(s) added: 1" message.

How to verify success: SSH connection should complete without password prompt, and ansible <target_host> -m ping should return successful pong response.

If something goes wrong: If ssh-copy-id fails, manually append public key content to target host's ~/.ssh/authorized_keys file. If connection is refused, verify SSH service is running with sudo systemctl status ssh on target host.

Task 4: Write and Execute Basic Playbooks

Applies to version(s): YAML playbook format supported in all modern Ansible versions (2.0+)

What this does: Creates reusable automation scripts that define desired system state and execute tasks across managed infrastructure.

Prerequisites: Ansible installed, inventory configured, SSH authentication working, basic YAML syntax knowledge.

What to avoid: Do not use tabs for indentation in YAML files - use spaces only. Do not run playbooks with --check mode in production without understanding module limitations.

GUI method:

  1. No native GUI method — Use text editor to create YAML playbook files manually.

CLI method (Bash):

  1. Create playbook file — Run nano ~/ansible/playbooks/basic-setup.yml
  2. Add playbook header — Enter --- on first line, then - name: Basic System Setup
  3. Define target hosts — Add hosts: all and become: yes for sudo privileges
  4. Add tasks section — Enter tasks: followed by indented task definitions
  5. Execute playbook — Run ansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/basic-setup.yml

What to look for: Playbook execution should show "PLAY RECAP" summary with ok/changed/unreachable/failed counts for each host. Tasks should display "ok" or "changed" status.

How to verify success: All hosts in PLAY RECAP should show 0 unreachable and 0 failed tasks. Run playbook again to verify idempotency with "changed=0" results.

If something goes wrong: If YAML syntax errors occur, use ansible-playbook --syntax-check <playbook.yml> to validate. If tasks fail, add -vvv flag for verbose debugging output.

Task 5: Use Ansible Vault for Sensitive Data

Applies to version(s): Ansible Vault available in Ansible 1.5+ with AES256 encryption

What this does: Encrypts sensitive data like passwords and API keys within Ansible files to maintain security while enabling automation.

Prerequisites: Ansible installed, playbooks or variable files containing sensitive data, secure password management.

What to avoid: Do not store vault passwords in plain text files or version control. Do not use weak passwords for vault encryption.

GUI method:

  1. No GUI method available — Ansible Vault operations require command-line interface.

CLI method (Bash):

  1. Create encrypted file — Run ansible-vault create ~/ansible/vault/secrets.yml
  2. Enter vault password — Provide strong password when prompted (password will be required for decryption)
  3. Add encrypted variables — Enter YAML format variables in opened editor, save and exit
  4. Use in playbook — Reference vault file with vars_files: ~/ansible/vault/secrets.yml
  5. Run with vault password — Execute ansible-playbook --ask-vault-pass <playbook.yml>

What to look for: Encrypted vault files should contain $ANSIBLE_VAULT;1.1;AES256 header followed by encrypted content. Playbook execution should prompt for vault password.

How to verify success: Run ansible-vault view ~/ansible/vault/secrets.yml and successfully decrypt content with correct password. Playbook should execute without exposing sensitive values in output.

If something goes wrong: If password is forgotten, vault files cannot be recovered - maintain secure password backup. If decryption fails, verify file integrity with ansible-vault view command.

Task 6: Install and Manage Software Packages

Applies to version(s): Package modules available across all Ansible versions with distribution-specific modules

What this does: Automates software installation, updates, and removal across multiple systems using appropriate package managers.

Prerequisites: Target hosts accessible, appropriate package manager available (apt, yum, dnf), sudo privileges configured.

What to avoid: ⚠️ WARNING Do not use state: latest in production without change control approval as this can cause unexpected updates. Do not mix package managers on the same system.

GUI method:

  1. No GUI method available — Package management requires playbook execution via command line.

CLI method (Bash):

  1. Create package playbook — Run nano ~/ansible/playbooks/package-management.yml
  2. Define package task for Ubuntu/Debian — Add task using apt module with name: <package_name> and state: present
  3. Define package task for RHEL/CentOS — Add task using yum or dnf module with same parameters
  4. Add update cache option — Include update_cache: yes for apt or update_cache: true for yum
  5. Execute package playbook — Run ansible-playbook -i <inventory> ~/ansible/playbooks/package-management.yml

What to look for: Tasks should show "changed" status when packages are installed or "ok" when already present. Package cache updates should complete successfully.

How to verify success: Run ansible all -m shell -a "which <package_command>" to verify package installation, or check with distribution-specific commands like dpkg -l <package>.

If something goes wrong: If package not found errors occur, verify package names are correct for target distribution. If permission denied, ensure become: yes is set in playbook and sudo access is configured.

Task 7: Configure and Manage Services

Applies to version(s): Service module available in all Ansible versions with systemd support in 2.2+

What this does: Automates starting, stopping, enabling, and disabling system services across managed infrastructure.

Prerequisites: Target systems with systemd or init system, sudo privileges, services installed on target hosts.

What to avoid: ⚠️ WARNING Do not stop critical services like SSH or networking without console access to target systems. Do not use state: restarted on production services without change approval.

GUI method:

  1. No GUI method available — Service management requires playbook execution via command line.

CLI method (Bash):

  1. Create service management playbook — Run nano ~/ansible/playbooks/service-management.yml
  2. Add service task — Use service module with name: <service_name> parameter
  3. Set service state — Add state: started, stopped, or restarted as required
  4. Configure service enablement — Add enabled: yes to start service at boot or enabled: no to disable
  5. Execute service playbook — Run ansible-playbook -i <inventory> ~/ansible/playbooks/service-management.yml

What to look for: Service tasks should show "changed" when service state is modified or "ok" when already in desired state. No error messages about service not found.

How to verify success: Run ansible all -m shell -a "systemctl status <service_name>" to verify service status matches desired configuration.

If something goes wrong: If service not found errors occur, verify service name spelling and that service is installed. If permission errors occur, ensure become: yes is configured and user has sudo access.

Task 8: Collect System Information and Facts

Applies to version(s): Setup module and fact gathering available in all Ansible versions

What this does: Gathers detailed system information from managed hosts for inventory, compliance reporting, and conditional task execution.

Prerequisites: Ansible control node configured, target hosts accessible via SSH, basic inventory file created.

What to avoid: Do not disable fact gathering globally with gather_facts: no unless specifically needed for performance, as many modules depend on system facts.

GUI method:

  1. No GUI method available — Fact collection requires command-line execution or playbook tasks.

CLI method (Bash):

  1. Collect all facts from host — Run ansible <hostname> -m setup
  2. Filter specific fact categories — Run ansible <hostname> -m setup -a "filter=ansible_os_family"
  3. Gather facts in playbook — Add gather_facts: yes to playbook header (enabled by default)
  4. Save facts to file — Run ansible <hostname> -m setup --tree ~/ansible/facts/
  5. Use facts in tasks — Reference facts with {{ ansible_hostname }} or {{ ansible_distribution }}

What to look for: Setup module should return JSON output containing system information like OS version, IP addresses, memory, and disk space. No connection or permission errors.

How to verify success: Verify specific facts are collected correctly by running ansible <hostname> -m setup -a "filter=ansible_hostname" and confirming output matches expected system hostname.

If something goes wrong: If fact gathering fails, check SSH connectivity and Python installation on target host. If specific facts are missing, verify the target system supports that information type.

Task 9: Handle Files and Templates

Applies to version(s): Copy, template, and file modules available in all Ansible versions with Jinja2 templating

What this does: Manages configuration files, copies static files, and generates dynamic content using templates across managed systems.

Prerequisites: Source files or templates available on control node, target directory permissions configured, backup strategy for modified files.

What to avoid: ⚠️ WARNING Do not overwrite critical system files without backup enabled using backup: yes. Do not use templates for binary files - use copy module instead.

GUI method:

  1. No GUI method available — File operations require playbook tasks executed via command line.

CLI method (Bash):

  1. Create file management playbook — Run nano ~/ansible/playbooks/file-management.yml
  2. Copy static file — Add task using copy module with src: <local_file> and dest: <remote_path>
  3. Set file permissions — Add mode: '0644', owner: <username>, and group: <groupname>
  4. Use template for dynamic content — Add task using template module with src: <template.j2> and dest: <remote_path>
  5. Enable backup — Add backup: yes to preserve original files

What to look for: File tasks should show "changed" when files are modified or "ok" when already correct. Template tasks should process Jinja2 variables successfully.

How to verify success: Run ansible all -m shell -a "ls -la <target_file>" to verify file exists with correct permissions, or use stat module to check file properties.

If something goes wrong: If permission denied errors occur, verify target directory exists and user has write access. If template errors occur, check Jinja2 syntax and variable definitions in playbook.

Task 10: Monitor and Troubleshoot Playbook Execution

Applies to version(s): Logging and debugging features available across all Ansible versions with enhancements in 2.5+

What this does: Provides visibility into playbook execution, identifies failures, and collects diagnostic information for troubleshooting automation issues.

Prerequisites: Ansible playbooks created, log file permissions configured, understanding of Ansible output formats.

What to avoid: Do not use maximum verbosity (-vvvv) in production as it may expose sensitive information in logs. Do not ignore unreachable hosts without investigating connectivity issues.

GUI method:

  1. No native GUI method — Use text editor or log viewing tools to examine Ansible log files and output.

CLI method (Bash):

  1. Run playbook with verbose output — Execute ansible-playbook -vvv <playbook.yml>
  2. Check syntax before execution — Run ansible-playbook --syntax-check <playbook.yml>
  3. Perform dry run — Execute ansible-playbook --check <playbook.yml>
  4. Enable logging to file — Set export ANSIBLE_LOG_PATH=~/ansible/logs/ansible.log
  5. Review specific host results — Run ansible-playbook --limit <hostname> <playbook.yml>

What to look for: Verbose output should show SSH connections, module execution details, and variable values. Syntax check should report "playbook: <filename> syntax is OK" or specific error locations.

How to verify success: PLAY RECAP should show all hosts with 0 unreachable and 0 failed tasks. Log files should contain detailed execution information without error messages.

If something goes wrong: If tasks fail intermittently, check network connectivity and SSH key authentication. If modules report errors, verify target system has required dependencies and permissions for the specific module operations.

Playbooks / Scenarios / Workflows

Understanding Ansible Playbooks

Playbooks are YAML files that define a series of tasks to be executed on target hosts. They represent the core automation workflows in Ansible, combining tasks, variables, handlers, and roles into executable automation scenarios.

Basic playbook structure includes:

Common Automation Scenarios

System Configuration Management

Playbooks for standardizing system configurations across multiple servers:

---
- name: Configure web servers
  hosts: webservers
  become: yes
  tasks:
    - name: Install Apache
      package:
        name: httpd
        state: present
    
    - name: Start Apache service
      service:
        name: httpd
        state: started
        enabled: yes

Application Deployment

Automated deployment workflows that handle code updates, service restarts, and validation:

---
- name: Deploy application
  hosts: app_servers
  vars:
    app_version: "{{ version | default('latest') }}"
  tasks:
    - name: Stop application service
      service:
        name: myapp
        state: stopped
    
    - name: Deploy new version
      copy:
        src: "/builds/myapp-{{ app_version }}.jar"
        dest: "/opt/myapp/myapp.jar"
      notify: restart application

Security Hardening

Playbooks that implement security policies and compliance requirements:

---
- name: Security hardening
  hosts: all
  become: yes
  tasks:
    - name: Update all packages
      package:
        name: "*"
        state: latest
    
    - name: Configure firewall rules
      firewalld:
        service: ssh
        permanent: yes
        state: enabled
        immediate: yes

Workflow Design Patterns

Multi-Stage Deployments

Orchestrating complex deployments across multiple environments:

Conditional Execution

Using when conditions and blocks for environment-specific tasks:

- name: Configure development settings
  template:
    src: dev-config.j2
    dest: /etc/myapp/config.yml
  when: environment == "development"

Error Handling and Recovery

Implementing robust error handling in automation workflows:

- block:
    - name: Risky operation
      command: /usr/local/bin/risky-command
  rescue:
    - name: Handle failure
      debug:
        msg: "Operation failed, initiating recovery"
    - name: Recovery action
      service:
        name: backup-service
        state: started

Playbook Execution Workflows

Standard Execution Process

  1. Validate playbook syntax using ansible-playbook --syntax-check
  2. Run in check mode first: ansible-playbook --check playbook.yml
  3. Execute with appropriate verbosity: ansible-playbook -v playbook.yml
  4. Monitor execution progress and task results
  5. Verify expected outcomes on target systems

Execution Options and Controls

Key execution parameters for different scenarios:

Scenario-Based Examples

Scenario: Emergency Security Patch

Situation: Critical security vulnerability requires immediate patching across all systems.

Workflow:

  1. Create targeted playbook for specific package update
  2. Test on development systems first
  3. Execute with rolling updates to minimize downtime
  4. Validate patch installation and system functionality

What would you do if 10% of systems fail the patch installation?

Answer: Immediately stop the rolling update, isolate failed systems, analyze failure logs, and determine if rollback is necessary while investigating the root cause.

Scenario: Database Maintenance Window

Situation: Scheduled maintenance requires coordinated shutdown of application tiers and database operations.

Workflow:

  1. Stop application services in reverse dependency order
  2. Perform database maintenance tasks
  3. Restart services in proper dependency order
  4. Validate application functionality

Role-Based Responsibilities

Tier 1 Responsibilities

Escalation Triggers

Escalate to Tier 2 when:

Tier 2/3 Responsibilities

Common Mistakes and Prevention

Mistake: Running Untested Playbooks in Production

Prevention: Always test playbooks in development environments and use --check mode before production execution.

Mistake: Insufficient Error Handling

Prevention: Implement proper rescue blocks and failure conditions for critical tasks that could impact system availability.

Mistake: Hardcoded Values in Playbooks

Prevention: Use variables and templates to make playbooks reusable across different environments and configurations.

Validation and Verification

Pre-Execution Validation

  1. Verify target host connectivity and access
  2. Confirm required variables are defined
  3. Check playbook syntax and structure
  4. Validate inventory and host group assignments

Post-Execution Verification

  1. Review task execution results and changed status
  2. Verify services are running as expected
  3. Test application functionality where applicable
  4. Check system logs for errors or warnings
  5. Confirm configuration changes are properly applied

Validation & Testing Procedures

Pre-Execution Validation

Objective: Verify playbook syntax, connectivity, and prerequisites before executing automation tasks in production environments.

Prerequisites:

Syntax Validation Steps:

  1. Navigate to playbook directory
  2. Execute syntax check: ansible-playbook --syntax-check playbook.yml
  3. Review output for syntax errors
  4. Correct any YAML formatting issues
  5. Validate inventory file: ansible-inventory --list
  6. Confirm expected hosts appear in output

Connectivity Testing:

  1. Test basic connectivity: ansible all -m ping
  2. Verify specific host groups: ansible webservers -m ping
  3. Check privilege escalation: ansible all -m setup --become
  4. Document any unreachable hosts

Expected Results:

Dry Run Testing

Objective: Execute playbooks in check mode to preview changes without modifying target systems.

Check Mode Execution:

  1. Run playbook with check flag: ansible-playbook --check playbook.yml
  2. Add diff output for detailed changes: ansible-playbook --check --diff playbook.yml
  3. Review proposed modifications carefully
  4. Verify changes align with intended outcomes
  5. Document any unexpected results

Limited Scope Testing:

  1. Test against single host: ansible-playbook --limit hostname playbook.yml --check
  2. Test specific host group: ansible-playbook --limit webservers playbook.yml --check
  3. Execute single task: ansible-playbook --tags specific_tag playbook.yml --check
  4. Validate task dependencies and order

What would you do? A dry run shows files being deleted that should remain. Answer: Stop execution, review playbook logic, check conditionals and file paths, validate against requirements before proceeding.

Development Environment Testing

Objective: Execute full playbook runs in non-production environments that mirror production configurations.

Test Environment Validation:

  1. Confirm development inventory matches production structure
  2. Verify similar OS versions and configurations
  3. Execute complete playbook: ansible-playbook -i dev_inventory playbook.yml
  4. Monitor execution for errors or warnings
  5. Validate all tasks complete successfully
  6. Test playbook idempotency by running twice

Service Validation Steps:

  1. Check service status: ansible all -m service -a "name=httpd state=started" --check
  2. Verify port connectivity: ansible all -m wait_for -a "port=80 timeout=10"
  3. Test application functionality manually
  4. Review system logs for errors
  5. Confirm configuration files contain expected values

Production Validation

Objective: Safely validate automation results in production environments with minimal risk.

Phased Deployment Testing:

  1. Select small subset of production hosts
  2. Execute with verbose output: ansible-playbook --limit "webservers[0:2]" -v playbook.yml
  3. Monitor system performance during execution
  4. Validate services remain operational
  5. Check application logs for errors
  6. Proceed to next phase only after validation

Post-Execution Validation:

  1. Verify all expected changes applied
  2. Test critical application functions
  3. Monitor system metrics for anomalies
  4. Confirm backup procedures completed if applicable
  5. Document any deviations from expected results

Common Validation Mistakes

Insufficient Testing Scope:

Environment Mismatches:

Escalation Triggers

Tier 1 Capabilities:

Escalate to Tier 2 When:

Escalate to Tier 3 When:

Troubleshooting Guide (decision-tree oriented)

Initial Problem Assessment

Objective: Systematically identify and resolve Ansible automation issues using a structured decision-tree approach.

Prerequisites: Access to Ansible control node, playbook files, and target system logs. Basic understanding of Ansible concepts covered in earlier sections.

Primary Decision Tree

Start Here: What type of failure are you experiencing?

Command Execution Issues

Symptom: Ansible commands fail to start or produce "command not found" errors.

Decision Path:

  1. Is Ansible installed?
    • Run: ansible --version
    • If command not found → Install Ansible, escalate to Tier 2 for installation approval
    • If version displays → Continue to step 2
  2. Is the inventory file accessible?
    • Check: ls -la /path/to/inventory
    • If file missing → Locate correct inventory path or recreate
    • If permission denied → Fix file permissions or escalate to Tier 2
  3. Are you in the correct working directory?
    • Verify playbook and configuration files are present
    • Check ansible.cfg location and settings

Tier 1 Actions: Verify file paths, check basic permissions, validate command syntax

Escalate to Tier 2: Installation issues, complex permission problems, environment configuration

Connectivity Problems

Symptom: "UNREACHABLE" errors, SSH failures, or authentication timeouts.

Decision Path:

  1. Can you ping the target host?
    • Run: ping target_hostname
    • If no response → Check network connectivity, verify hostname/IP
    • If ping succeeds → Continue to step 2
  2. Can you SSH manually to the target?
    • Test: ssh username@target_hostname
    • If SSH fails → Check SSH service, firewall rules, escalate to Tier 2
    • If SSH succeeds → Continue to step 3
  3. Are Ansible connection parameters correct?
    • Verify inventory file has correct hostnames, usernames, SSH keys
    • Check ansible_host, ansible_user, ansible_ssh_private_key_file variables
    • Test with: ansible target_host -m ping

Common Resolution Steps:

Tier 1 Actions: Basic connectivity tests, inventory verification, SSH key validation

Escalate to Tier 2: Network configuration, firewall rules, SSH service configuration, privilege escalation setup

Syntax and Structure Issues

Symptom: YAML parsing errors, "syntax error" messages, playbook won't start.

Decision Path:

  1. Is the YAML syntax valid?
    • Run: ansible-playbook --syntax-check playbook.yml
    • If syntax errors → Fix indentation, quotes, colons as indicated
    • If syntax check passes → Continue to step 2
  2. Are all required parameters present?
    • Verify each task has name and module
    • Check that playbook has hosts and tasks sections
    • Validate variable names and references
  3. Are module parameters correct?
    • Check module documentation: ansible-doc module_name
    • Verify required parameters are provided
    • Check parameter spelling and format

What would you do? You encounter this error: "ERROR! 'become_user' is not a valid attribute for a Play"

Answer: Check indentation - 'become_user' is likely indented at the wrong level. It should be at the same level as 'hosts' and 'tasks', not nested under a task.

Tier 1 Actions: Syntax validation, basic YAML fixes, parameter verification

Escalate to Tier 2: Complex playbook restructuring, custom module issues, advanced templating problems

Task Execution Failures

Symptom: Playbook starts but individual tasks fail with "FAILED" status.

Decision Path:

  1. What is the specific error message?
    • Read the failure output carefully
    • Look for "msg:" field in the error details
    • Note the failing module and parameters
  2. Is it a permissions issue?
    • Check if error mentions "Permission denied" or "Operation not permitted"
    • Verify become/sudo configuration if elevated privileges needed
    • Test with: ansible-playbook -b playbook.yml (if appropriate)
  3. Is it a missing dependency?
    • Check if error mentions missing packages, files, or services
    • Verify target system has required software installed
    • Add dependency installation tasks if needed
  4. Is it a variable or template issue?
    • Look for "undefined variable" errors
    • Check variable definitions in inventory, group_vars, or host_vars
    • Verify Jinja2 template syntax

Validation Steps:

Tier 1 Actions: Read error messages, check basic permissions, verify simple variables

Escalate to Tier 2: Complex permission issues, system configuration problems, advanced templating, custom facts

Performance Problems

Symptom: Playbooks run slowly, timeout errors, or hang indefinitely.

Decision Path:

  1. Is the issue with connection speed?
    • Test network latency to target hosts
    • Check if SSH multiplexing is enabled
    • Consider increasing timeout values
  2. Are you running too many parallel operations?
    • Check forks setting in ansible.cfg
    • Reduce parallelism: ansible-playbook --forks=5 playbook.yml
    • Monitor system resources on control node
  3. Are individual tasks taking too long?
    • Identify slow tasks using verbose output
    • Check for inefficient loops or large file operations
    • Consider breaking large tasks into smaller ones

Tier 1 Actions: Basic performance monitoring, adjust simple settings like forks

Escalate to Tier 2: Network optimization, system resource issues, complex performance tuning

Escalation Triggers

Immediately escalate when encountering:

Expected Result: Issue identified and either resolved at Tier 1 level or properly escalated with complete diagnostic information.

Prerequisites & Dependencies

System Requirements

Before installing Ansible, verify your environment meets these minimum requirements:

Control Node Dependencies

Install these packages before Ansible installation:

# Ubuntu/Debian
sudo apt update
sudo apt install python3 python3-pip openssh-client

# RHEL/CentOS/Fedora
sudo dnf install python3 python3-pip openssh-clients

# macOS
brew install python3

Managed Node Requirements

Target systems must have:

SSH Key Authentication Setup

Configure passwordless SSH access for automation:

# Generate SSH key pair on control node
ssh-keygen -t rsa -b 4096 -C "ansible-automation"

# Copy public key to managed nodes
ssh-copy-id username@target-host

# Test connectivity
ssh username@target-host "echo 'SSH connection successful'"

Python Package Dependencies

Install required Python libraries:

# Essential packages
pip3 install --user ansible-core
pip3 install --user paramiko  # SSH connections
pip3 install --user PyYAML    # YAML parsing

# Optional but recommended
pip3 install --user jinja2    # Template engine
pip3 install --user cryptography  # Vault encryption

Network and Firewall Configuration

Ensure network connectivity:

Privilege Escalation Setup

Configure sudo access for automation tasks:

# Add user to sudoers with NOPASSWD (on managed nodes)
echo "ansible-user ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ansible-user

# Verify sudo access
sudo -l

Validation Checklist

Verify prerequisites before proceeding:

  1. Python Version: python3 --version shows 3.8+
  2. SSH Connectivity: Passwordless SSH works to all targets
  3. Sudo Access: User can execute sudo commands without password
  4. Network Latency: Acceptable response times to managed nodes
  5. Disk Space: Sufficient space for playbooks and logs

Common Dependency Issues

Watch for these frequent problems:

Role-Based Prerequisites

Tier 1 Responsibilities:

Escalate to Tier 2 when:

Installation / Deployment / Setup

Installation Objective

Install and configure Ansible on control nodes to manage infrastructure automation. This section covers installation methods, initial configuration, and deployment verification.

Prerequisites

Installation Methods

Package Manager Installation (Recommended for Tier 1)

Install using distribution package managers for stable, supported versions.

Red Hat/CentOS/Fedora:

sudo dnf install ansible-core
# or for older systems
sudo yum install ansible

Ubuntu/Debian:

sudo apt update
sudo apt install ansible

Python pip Installation

Install latest version using Python package manager. Requires Tier 2 approval for production systems.

pip3 install ansible
# or for user-specific installation
pip3 install --user ansible

Initial Configuration

Ansible Configuration File

Create or modify ansible.cfg in project directory or /etc/ansible/ansible.cfg:

[defaults]
inventory = ./inventory
host_key_checking = False
remote_user = ansible
private_key_file = ~/.ssh/ansible_key
timeout = 30

[privilege_escalation]
become = True
become_method = sudo
become_user = root

Inventory Setup

Create inventory file listing managed nodes:

[webservers]
web1.example.com
web2.example.com

[databases]
db1.example.com ansible_host=192.168.1.100
db2.example.com ansible_host=192.168.1.101

[production:children]
webservers
databases

SSH Key Configuration

Generate SSH Key Pair

ssh-keygen -t rsa -b 4096 -f ~/.ssh/ansible_key
# Do not set passphrase for automation use

Distribute Public Key

ssh-copy-id -i ~/.ssh/ansible_key.pub user@target-host
# Repeat for all managed nodes

Installation Validation

Version Verification

ansible --version
ansible-playbook --version

Expected Result: Version information displays without errors, showing ansible-core version and Python version.

Connectivity Test

ansible all -m ping
ansible all -m setup --limit 1

Expected Result: All hosts return "pong" response and system facts display for test host.

Privilege Escalation Test

ansible all -m command -a "whoami" --become

Expected Result: Returns "root" for all managed nodes.

Common Installation Issues

Python Version Conflicts

Symptom: ImportError or module not found errors

Tier 1 Action: Verify Python version with python3 --version. If below 3.8, escalate to Tier 2.

Resolution: Update Python or use virtual environment with correct version.

SSH Connection Failures

Symptom: "UNREACHABLE" errors during ping test

Tier 1 Troubleshooting:

Permission Denied Errors

Symptom: "FAILED" status with permission errors

Tier 1 Actions:

Deployment Scenarios

Scenario: New Control Node Setup

Situation: Setting up Ansible on fresh Linux server for team use.

What would you do?

  1. Install Ansible using package manager
  2. Create dedicated ansible user account
  3. Generate SSH keys for ansible user
  4. Configure ansible.cfg with team standards
  5. Test connectivity to existing managed nodes

Common Mistake: Using root user for Ansible operations. Always use dedicated service account with appropriate sudo privileges.

Scenario: Multi-Environment Setup

Situation: Separate inventories needed for development, staging, and production.

Tier 1 Approach: Create separate inventory files (dev-inventory, staging-inventory, prod-inventory) and specify using -i flag.

Tier 2 Requirement: Production environment access requires approval and separate SSH keys.

Escalation Triggers

Post-Installation Security Checklist

Operational Procedures (daily/weekly/monthly)

Daily Operations

Morning Health Check

Objective: Verify Ansible infrastructure is operational and ready for daily automation tasks.

Prerequisites: Access to Ansible control nodes and monitoring dashboards.

  1. Check Ansible control node system resources (CPU, memory, disk space)
  2. Verify SSH connectivity to managed nodes using ansible ping module
  3. Review overnight playbook execution logs for failures
  4. Validate inventory synchronization from external sources
  5. Check credential vault accessibility

Expected Result: All systems responsive with no critical errors identified.

Validation: Run ansible all -m ping successfully against sample inventory groups.

Escalation: If more than 10% of managed nodes unreachable or control node resources exceed 80% utilization.

Playbook Execution Review

What would you do if a critical daily playbook failed overnight?

Tier 1 Actions: Basic log review, system connectivity checks, standard playbook re-execution.

Escalation Required: Playbook modification, credential issues, infrastructure problems affecting multiple systems.

Weekly Operations

Inventory Audit and Cleanup

Objective: Maintain accurate inventory and remove obsolete entries.

  1. Compare dynamic inventory against actual infrastructure
  2. Identify unreachable hosts that have been offline for more than 7 days
  3. Verify group memberships align with current system roles
  4. Update host variables for systems with configuration changes
  5. Remove decommissioned systems from static inventory files

Common Mistake: Removing hosts that are temporarily offline for maintenance. Always verify decommission status before deletion.

Playbook Performance Analysis

Objective: Identify performance bottlenecks and optimization opportunities.

  1. Review execution time reports for all playbooks run in past week
  2. Identify playbooks with increasing execution times
  3. Analyze task-level timing for slow playbooks
  4. Document performance trends and recommend optimizations

Tier 2 Responsibility: Performance analysis and optimization recommendations require deeper Ansible expertise.

Security Review

  1. Audit vault file access logs
  2. Review SSH key usage and rotation schedules
  3. Verify privilege escalation is properly configured
  4. Check for hardcoded credentials in playbooks (should find none)

Monthly Operations

Comprehensive System Maintenance

Objective: Perform thorough maintenance to ensure long-term system reliability.

  1. Update Ansible core and collections to latest stable versions
  2. Review and rotate service account credentials
  3. Analyze disk usage trends on control nodes
  4. Archive old execution logs and reports
  5. Test disaster recovery procedures
  6. Review and update documentation

Escalation Trigger: Any maintenance activity that could impact production automation requires Tier 2/3 approval.

Capacity Planning Review

Scenario: You notice Ansible job queue times increasing during peak hours.

Analysis Steps:

Compliance and Audit Preparation

  1. Generate execution reports for all automated changes
  2. Verify change tracking and approval workflows
  3. Review access control configurations
  4. Prepare documentation for compliance requirements
  5. Test audit trail completeness

Training and Knowledge Transfer

Monthly Requirements:

What would you do if a new team member needs Ansible access?

Correct Answer: Follow established access provisioning procedures, ensure proper training completion, and verify role-appropriate permissions. Never grant administrative access as starting point.

Emergency Procedures

Incident Response

Tier 1 Immediate Actions:

Escalation Required: Infrastructure-wide automation failures, security incidents, or any situation requiring playbook modifications during incident response.

Monitoring, Metrics & Alerting

Objective

Monitor Ansible automation health, track performance metrics, and configure alerting to ensure reliable automation operations and proactive issue detection.

Prerequisites

Key Metrics to Monitor

Playbook Execution Metrics

System Resource Metrics

Automation Controller Metrics (if applicable)

Monitoring Implementation

Ansible Callback Plugins

Configure callback plugins to export metrics to monitoring systems:

# ansible.cfg
[defaults]
callback_plugins = /path/to/callback/plugins
callbacks_enabled = timer, profile_tasks, prometheus

[callback_prometheus]
prometheus_gateway = http://pushgateway:9091
job_name = ansible_playbooks

Custom Metrics Collection

Implement custom tasks within playbooks to report application-specific metrics:

- name: Report deployment metrics
  uri:
    url: "http://monitoring-api/metrics"
    method: POST
    body_format: json
    body:
      deployment_time: "{{ ansible_date_time.epoch }}"
      hosts_updated: "{{ ansible_play_hosts | length }}"
      playbook_name: "{{ ansible_playbook }}"
  delegate_to: localhost
  run_once: true

Log Monitoring and Analysis

Centralized Log Collection

Configure log forwarding to centralized systems:

# rsyslog configuration for Ansible logs
$template AnsibleLogFormat,"%timestamp% %hostname% ansible: %msg%\n"
if $programname == 'ansible' then /var/log/ansible/ansible.log;AnsibleLogFormat
& stop

Log Analysis Patterns

Alerting Configuration

Critical Alerts

Warning Alerts

Sample Prometheus Alert Rules

groups:
- name: ansible.rules
  rules:
  - alert: AnsiblePlaybookFailureRate
    expr: rate(ansible_playbook_failures_total[5m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High Ansible playbook failure rate"
      
  - alert: AnsibleControlNodeDown
    expr: up{job="ansible-control"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Ansible control node is unreachable"

Monitoring Dashboards

Operational Dashboard Elements

Performance Dashboard Elements

Troubleshooting Scenarios

Scenario: Playbook Performance Degradation

What would you do if playbook execution times suddenly increased by 50%?

  1. Check control node resource utilization metrics
  2. Analyze task-level timing data from callback plugins
  3. Verify network connectivity and latency to target hosts
  4. Review recent changes to playbooks or inventory
  5. Check for increased parallelism conflicts

Correct approach: Start with infrastructure metrics, then drill down to application-level timing data to isolate the bottleneck.

Scenario: Intermittent Connection Failures

What would you do if seeing sporadic SSH connection failures across multiple hosts?

  1. Monitor SSH connection pool usage and limits
  2. Check network infrastructure between control and managed nodes
  3. Analyze SSH daemon logs on target hosts
  4. Review Ansible fork and timeout configurations
  5. Validate SSH key authentication status

Role-Based Responsibilities

Tier 1 Support

Tier 2/3 Support

Common Monitoring Mistakes

Validation Steps

  1. Verify metrics are being collected and stored correctly
  2. Test alert notifications through all configured channels
  3. Confirm dashboard data accuracy against known playbook runs
  4. Validate log parsing and analysis rules
  5. Test escalation procedures with simulated incidents

Escalation Triggers

Compliance, Logging & Audit Requirements

Audit Trail Objectives

Ansible automation must maintain comprehensive audit trails to demonstrate compliance with organizational policies, regulatory requirements, and security standards. All automation activities require detailed logging for accountability, forensic analysis, and compliance reporting.

Required Logging Components

Ansible Controller Audit Logging

Playbook Execution Logging

Compliance Configuration Requirements

Log Retention Policies

Configure log retention based on compliance requirements:

AWX_TASK_ENV['ANSIBLE_LOG_PATH'] = '/var/log/ansible/ansible.log'
LOGGING_AGGREGATOR_ENABLED = True
LOGGING_AGGREGATOR_HOST = 'siem.company.com'
LOGGING_AGGREGATOR_PORT = 514
LOGGING_AGGREGATOR_TYPE = 'syslog'
LOGGING_AGGREGATOR_PROTOCOL = 'tcp'

Audit Database Configuration

# Enable detailed activity stream
ACTIVITY_STREAM_ENABLED = True
ACTIVITY_STREAM_ENABLED_FOR_INVENTORY_SYNC = True

# Configure audit log forwarding
LOGGING = {
    'version': 1,
    'handlers': {
        'audit_file': {
            'class': 'logging.handlers.RotatingFileHandler',
            'filename': '/var/log/tower/audit.log',
            'maxBytes': 1024*1024*100,
            'backupCount': 10,
        }
    }
}

Regulatory Compliance Scenarios

SOX Compliance Example

Scenario: Financial system configuration changes require documented approval and audit trail.

What would you do? A playbook needs to modify database configurations on production financial systems.

Correct approach:

  1. Implement approval workflow in job template
  2. Configure detailed logging with change tracking
  3. Require dual authorization for execution
  4. Generate compliance reports from audit logs

HIPAA Compliance Example

Scenario: Healthcare data processing systems require access logging and data handling audit trails.

Required controls:

Audit Log Analysis Procedures

Daily Audit Review Process

Objective: Identify compliance violations and security anomalies in Ansible automation activities.

Prerequisites: Access to centralized logging system and audit analysis tools.

Steps:

  1. Review failed job executions for unauthorized access attempts
  2. Analyze privilege escalation patterns for policy violations
  3. Verify credential usage aligns with authorized personnel
  4. Check inventory modifications against change management records
  5. Validate template executions match approved automation workflows
  6. Document any anomalies requiring investigation

Expected result: Daily compliance status report with identified violations and remediation actions.

Compliance Report Generation

Validation steps:

Common Compliance Violations

Insufficient Logging Detail

Mistake: Running playbooks without verbose logging enabled for compliance-sensitive operations.

Prevention: Configure job templates with mandatory verbose logging for regulated systems. Use callback plugins to ensure comprehensive audit trails.

Missing Change Attribution

Mistake: Automated changes without clear business justification or approval documentation.

Prevention: Implement workflow approvals with business justification requirements. Link automation jobs to change management tickets.

Role-Based Compliance Responsibilities

Tier 1 Responsibilities

Escalation to Tier 2/3

Escalate when:

Audit Evidence Preservation

Legal Hold Procedures

When legal or regulatory investigations require audit evidence preservation:

  1. Immediately suspend log rotation and deletion policies
  2. Create forensic copies of relevant audit databases
  3. Document chain of custody for all preserved evidence
  4. Coordinate with legal and compliance teams for evidence handling

Escalation trigger: Any request for audit evidence preservation must be escalated to Tier 3 and management within 2 hours of notification.

Backup, Restore & Disaster Recovery

Backup Strategy Overview

Ansible environments require comprehensive backup strategies covering playbooks, inventory files, configuration data, and execution history. This section focuses on operational backup and recovery procedures for maintaining business continuity.

Critical Components to Backup

Daily Backup Procedures

Objective

Perform automated daily backups of all critical Ansible components to ensure recovery capability within defined RTO/RPO targets.

Prerequisites

Backup Execution Steps

  1. Verify backup storage accessibility and available space
  2. Create timestamped backup directory structure
  3. Execute configuration files backup using rsync or tar
  4. Dump AWX/Tower database if applicable
  5. Compress and encrypt backup archives
  6. Transfer backups to offsite storage location
  7. Verify backup integrity and completeness
  8. Update backup inventory and retention records

Expected Result

Complete backup archive containing all critical components, successfully transferred to secure storage with verified integrity.

Validation Steps

Restore Procedures

Emergency Restore Scenario

When primary Ansible infrastructure fails, follow these restoration steps to minimize downtime and restore operational capability.

Restore Prerequisites

System Restore Steps

  1. Install base Ansible packages on replacement system
  2. Create required user accounts and directory structures
  3. Extract configuration files from backup archives
  4. Restore playbooks, roles, and inventory files
  5. Decrypt and restore vault files and SSH keys
  6. Configure network settings and firewall rules
  7. Restore AWX/Tower database and configuration
  8. Start Ansible services and verify functionality
  9. Test connectivity to managed nodes
  10. Execute validation playbooks to confirm operation

Restore Validation

Disaster Recovery Planning

Recovery Time Objectives (RTO)

Recovery Point Objectives (RPO)

Training Scenario: Control Node Failure

Situation: Primary Ansible control node experiences hardware failure during business hours. Critical automation jobs are scheduled to run within 2 hours.

What would you do?

  1. Immediately assess scope of failure and impact
  2. Activate disaster recovery procedures
  3. Provision replacement infrastructure
  4. Begin restore process using latest backup
  5. Communicate status to stakeholders

Correct Response: Follow established disaster recovery runbook, prioritizing restoration of critical automation workflows first. Communicate regularly with stakeholders about recovery progress and expected completion time.

Role-Based Responsibilities

Tier 1 Responsibilities

Tier 2/3 Responsibilities

Common Mistakes and Prevention

Escalation Triggers

Integration & Interoperability

Section Objective

Learn how to integrate Ansible with external systems, APIs, and third-party tools to create comprehensive automation workflows that span multiple platforms and technologies.

Prerequisites

REST API Integration

Using the uri Module

The uri module enables Ansible to interact with REST APIs for system integration:

- name: Create user via API
  uri:
    url: "https://api.example.com/users"
    method: POST
    headers:
      Authorization: "Bearer {{ api_token }}"
      Content-Type: "application/json"
    body_format: json
    body:
      username: "{{ new_user }}"
      email: "{{ user_email }}"
    status_code: [201, 409]
  register: api_response

- name: Handle API response
  debug:
    msg: "User created with ID: {{ api_response.json.id }}"
  when: api_response.status == 201

Authentication Methods

Common API authentication patterns in Ansible:

# Token-based authentication
- name: Get API token
  uri:
    url: "https://api.example.com/auth/token"
    method: POST
    body_format: json
    body:
      username: "{{ vault_api_user }}"
      password: "{{ vault_api_pass }}"
  register: token_response

- name: Use token for subsequent calls
  uri:
    url: "https://api.example.com/data"
    headers:
      Authorization: "Bearer {{ token_response.json.access_token }}"

Database Integration

MySQL Integration

- name: Query application database
  mysql_query:
    login_host: "{{ db_host }}"
    login_user: "{{ db_user }}"
    login_password: "{{ db_password }}"
    login_db: "{{ app_database }}"
    query: "SELECT status FROM services WHERE name = %s"
    positional_args:
      - "{{ service_name }}"
  register: service_status

- name: Proceed based on database state
  include_tasks: deploy_service.yml
  when: service_status.query_result[0][0] == 'ready'

PostgreSQL Integration

- name: Update configuration table
  postgresql_query:
    db: "{{ postgres_db }}"
    login_host: "{{ postgres_host }}"
    login_user: "{{ postgres_user }}"
    login_password: "{{ postgres_password }}"
    query: |
      UPDATE config_settings 
      SET value = %s, updated_at = NOW() 
      WHERE key = %s
    positional_args:
      - "{{ new_config_value }}"
      - "{{ config_key }}"

Cloud Platform Integration

AWS Integration

- name: Launch EC2 instance and configure
  block:
    - name: Create EC2 instance
      amazon.aws.ec2_instance:
        name: "{{ instance_name }}"
        image_id: "{{ ami_id }}"
        instance_type: "{{ instance_type }}"
        security_group: "{{ security_group }}"
        vpc_subnet_id: "{{ subnet_id }}"
        state: present
      register: ec2_result

    - name: Wait for instance to be ready
      wait_for:
        host: "{{ ec2_result.instances[0].public_ip_address }}"
        port: 22
        timeout: 300

    - name: Add to inventory
      add_host:
        name: "{{ ec2_result.instances[0].public_ip_address }}"
        groups: web_servers

Azure Integration

- name: Create Azure resource group and VM
  block:
    - name: Create resource group
      azure_rm_resourcegroup:
        name: "{{ resource_group }}"
        location: "{{ azure_region }}"

    - name: Create virtual machine
      azure_rm_virtualmachine:
        resource_group: "{{ resource_group }}"
        name: "{{ vm_name }}"
        vm_size: "{{ vm_size }}"
        admin_username: "{{ admin_user }}"
        ssh_password_enabled: false
        ssh_public_keys:
          - path: "/home/{{ admin_user }}/.ssh/authorized_keys"
            key_data: "{{ ssh_public_key }}"

Monitoring System Integration

Prometheus Integration

- name: Query Prometheus for system metrics
  uri:
    url: "{{ prometheus_url }}/api/v1/query"
    method: GET
    body_format: form-urlencoded
    body:
      query: "up{job='{{ service_name }}'}"
  register: prometheus_response

- name: Check service health
  set_fact:
    service_healthy: "{{ prometheus_response.json.data.result | length > 0 }}"

- name: Restart service if unhealthy
  systemd:
    name: "{{ service_name }}"
    state: restarted
  when: not service_healthy

Grafana Dashboard Management

- name: Create Grafana dashboard
  uri:
    url: "{{ grafana_url }}/api/dashboards/db"
    method: POST
    headers:
      Authorization: "Bearer {{ grafana_api_key }}"
      Content-Type: "application/json"
    body_format: json
    body:
      dashboard: "{{ dashboard_config }}"
      overwrite: true
  register: dashboard_result

Version Control Integration

Git Repository Operations

- name: Clone and deploy from Git
  block:
    - name: Clone repository
      git:
        repo: "{{ git_repo_url }}"
        dest: "{{ deploy_path }}"
        version: "{{ git_branch | default('main') }}"
        force: yes
      register: git_result

    - name: Install dependencies if code changed
      command: "{{ install_command }}"
      args:
        chdir: "{{ deploy_path }}"
      when: git_result.changed

    - name: Restart application
      systemd:
        name: "{{ app_service }}"
        state: restarted
      when: git_result.changed

Container Orchestration Integration

Kubernetes Integration

- name: Deploy to Kubernetes cluster
  kubernetes.core.k8s:
    state: present
    definition:
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: "{{ app_name }}"
        namespace: "{{ k8s_namespace }}"
      spec:
        replicas: "{{ replica_count }}"
        selector:
          matchLabels:
            app: "{{ app_name }}"
        template:
          metadata:
            labels:
              app: "{{ app_name }}"
          spec:
            containers:
            - name: "{{ app_name }}"
              image: "{{ container_image }}"
              ports:
              - containerPort: "{{ app_port }}"

Docker Swarm Integration

- name: Deploy Docker service
  docker_swarm_service:
    name: "{{ service_name }}"
    image: "{{ docker_image }}"
    replicas: "{{ service_replicas }}"
    networks:
      - "{{ docker_network }}"
    env:
      DATABASE_URL: "{{ database_connection }}"
    publish:
      - published_port: "{{ external_port }}"
        target_port: "{{ internal_port }}"

Configuration Management Integration

Consul Integration

- name: Register service in Consul
  uri:
    url: "{{ consul_url }}/v1/agent/service/register"
    method: PUT
    body_format: json
    body:
      ID: "{{ service_id }}"
      Name: "{{ service_name }}"
      Address: "{{ ansible_default_ipv4.address }}"
      Port: "{{ service_port }}"
      Check:
        HTTP: "http://{{ ansible_default_ipv4.address }}:{{ service_port }}/health"
        Interval: "30s"

- name: Retrieve configuration from Consul KV
  uri:
    url: "{{ consul_url }}/v1/kv/{{ config_path }}"
    method: GET
  register: consul_config

Notification System Integration

Slack Integration

- name: Send deployment notification
  uri:
    url: "{{ slack_webhook_url }}"
    method: POST
    body_format: json
    body:
      channel: "{{ slack_channel }}"
      username: "Ansible Bot"
      text: "Deployment of {{ application_name }} to {{ environment }} completed successfully"
      attachments:
        - color: "good"
          fields:
            - title: "Version"
              value: "{{ deployment_version }}"
              short: true
            - title: "Environment"
              value: "{{ target_environment }}"
              short: true

Email Integration

- name: Send deployment report via email
  mail:
    to: "{{ deployment_team_email }}"
    subject: "Deployment Report - {{ application_name }}"
    body: |
      Deployment Summary:
      
      Application: {{ application_name }}
      Environment: {{ target_environment }}
      Version: {{ deployment_version }}
      Status: {{ deployment_status }}
      
      Deployed services:
      {% for service in deployed_services %}
      - {{ service.name }}: {{ service.status }}
      {% endfor %}
    smtp: "{{ smtp_server }}"

Integration Scenarios and Decision Points

Scenario: Multi-System Deployment

Situation: You need to deploy an application that requires database updates, load balancer configuration, and monitoring setup.

What would you do?

  1. Deploy application first, then configure supporting systems
  2. Configure all supporting systems first, then deploy application
  3. Use a coordinated approach with proper ordering and validation

Correct Answer: Option 3 - Use coordinated approach

Reasoning: Proper integration requires careful orchestration to ensure dependencies are met and systems remain consistent throughout the deployment process.

Scenario: API Integration Failure

Situation: An API call in your playbook returns a 500 error during execution.

What would you do?

  1. Ignore the error and continue with the playbook
  2. Implement retry logic with exponential backoff
  3. Fail immediately and alert the team

Correct Answer: Option 2 - Implement retry logic

Reasoning: Transient API failures are common; retry logic provides resilience while still failing appropriately for persistent issues.

Common Integration Mistakes

Authentication Token Management

Mistake: Hardcoding API tokens or storing them in plain text

Solution: Always use Ansible Vault for sensitive credentials and implement token refresh logic for long-running operations

Error Handling in Integrations

Mistake: Not handling partial failures in multi-system operations

Solution: Implement comprehensive error handling with rollback capabilities and clear escalation paths

Dependency Management

Mistake: Not validating external system availability before proceeding

Solution: Always include connectivity and health checks before performing integration operations

Role-Based Responsibilities

Tier 1 Responsibilities

Tier 2/3 Responsibilities

Validation Steps

Integration Health Check

- name: Validate integration endpoints
  uri:
    url: "{{ item.health_check_url }}"
    method: GET
    status_code: 200
  loop: "{{ integration_endpoints }}"
  register: health_checks

- name: Report integration status
  debug:
    msg: "All integrations healthy: {{ health_checks.results | selectattr('status', 'equalto', 200) | list | length == integration_endpoints | length }}"

Expected Results

After completing integration tasks, you should observe:

Escalation Triggers

Escalate to Tier 2/3 when:

Tools, Scripts & Automation

Ansible Development Tools

Several tools enhance Ansible development and operations workflows. Each serves specific purposes in the automation lifecycle.

Ansible-lint

Static analysis tool that checks playbooks for best practices and potential issues.

# Install ansible-lint
pip install ansible-lint

# Run against playbook
ansible-lint playbook.yml

# Run against role
ansible-lint roles/webserver/

# Skip specific rules
ansible-lint -x 301,302 playbook.yml

Common lint rules address:

Ansible-vault Integration Scripts

Custom scripts for managing encrypted content in CI/CD pipelines.

#!/bin/bash
# vault-deploy.sh
export ANSIBLE_VAULT_PASSWORD_FILE=/secure/vault-pass
ansible-playbook -i inventory/production deploy.yml --vault-password-file $ANSIBLE_VAULT_PASSWORD_FILE

Molecule Testing Framework

Tool for testing Ansible roles across multiple scenarios and platforms.

# Initialize molecule in role directory
molecule init scenario

# Run full test cycle
molecule test

# Create test instance
molecule create

# Run converge only
molecule converge

Custom Automation Scripts

Inventory Management Scripts

Dynamic inventory scripts pull host information from external sources.

#!/usr/bin/env python3
# aws_inventory.py
import boto3
import json

def get_ec2_inventory():
    ec2 = boto3.client('ec2')
    response = ec2.describe_instances()
    
    inventory = {'_meta': {'hostvars': {}}}
    
    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            if instance['State']['Name'] == 'running':
                # Process instance data
                pass
    
    return inventory

if __name__ == '__main__':
    print(json.dumps(get_ec2_inventory()))

Deployment Wrapper Scripts

Scripts that standardize deployment processes across environments.

#!/bin/bash
# deploy-wrapper.sh

ENVIRONMENT=$1
PLAYBOOK=$2
EXTRA_VARS=$3

if [[ -z "$ENVIRONMENT" || -z "$PLAYBOOK" ]]; then
    echo "Usage: $0   [extra-vars]"
    exit 1
fi

# Validate environment
case $ENVIRONMENT in
    dev|staging|production)
        echo "Deploying to $ENVIRONMENT"
        ;;
    *)
        echo "Invalid environment: $ENVIRONMENT"
        exit 1
        ;;
esac

# Set environment-specific variables
INVENTORY="inventory/$ENVIRONMENT"
VAULT_FILE="group_vars/$ENVIRONMENT/vault.yml"

# Execute playbook
ansible-playbook -i $INVENTORY $PLAYBOOK --vault-password-file ~/.vault_pass $EXTRA_VARS

CI/CD Integration

Jenkins Pipeline Integration

Jenkinsfile examples for Ansible automation in CI/CD pipelines.

pipeline {
    agent any
    
    stages {
        stage('Lint') {
            steps {
                sh 'ansible-lint playbooks/'
            }
        }
        
        stage('Test') {
            steps {
                sh 'molecule test'
            }
        }
        
        stage('Deploy') {
            when {
                branch 'main'
            }
            steps {
                withCredentials([file(credentialsId: 'vault-password', variable: 'VAULT_PASS')]) {
                    sh 'ansible-playbook -i inventory/production deploy.yml --vault-password-file $VAULT_PASS'
                }
            }
        }
    }
}

GitLab CI Integration

GitLab CI configuration for automated Ansible deployments.

# .gitlab-ci.yml
stages:
  - validate
  - test
  - deploy

variables:
  ANSIBLE_HOST_KEY_CHECKING: "False"

validate:
  stage: validate
  script:
    - ansible-lint playbooks/
    - ansible-playbook --syntax-check playbooks/site.yml

test:
  stage: test
  script:
    - molecule test
  only:
    - merge_requests

deploy_staging:
  stage: deploy
  script:
    - ansible-playbook -i inventory/staging deploy.yml
  only:
    - develop

deploy_production:
  stage: deploy
  script:
    - ansible-playbook -i inventory/production deploy.yml
  when: manual
  only:
    - main

Monitoring and Logging Automation

Callback Plugins

Custom callback plugins for enhanced logging and monitoring.

# callback_plugins/custom_logger.py
from ansible.plugins.callback import CallbackBase
import json
import requests

class CallbackModule(CallbackBase):
    CALLBACK_VERSION = 2.0
    CALLBACK_TYPE = 'aggregate'
    CALLBACK_NAME = 'custom_logger'

    def v2_playbook_on_stats(self, stats):
        # Send completion stats to monitoring system
        data = {
            'hosts': list(stats.processed.keys()),
            'ok': stats.ok,
            'failures': stats.failures,
            'unreachable': stats.dark
        }
        
        # Post to monitoring endpoint
        requests.post('http://monitoring.example.com/ansible', json=data)

Log Analysis Scripts

Scripts for parsing and analyzing Ansible execution logs.

#!/usr/bin/env python3
# analyze_logs.py
import re
import sys
from collections import defaultdict

def parse_ansible_log(log_file):
    stats = defaultdict(int)
    failed_tasks = []
    
    with open(log_file, 'r') as f:
        for line in f:
            if 'TASK [' in line:
                stats['tasks'] += 1
            elif 'fatal:' in line:
                stats['failures'] += 1
                failed_tasks.append(line.strip())
            elif 'ok:' in line:
                stats['success'] += 1
    
    return stats, failed_tasks

if __name__ == '__main__':
    stats, failures = parse_ansible_log(sys.argv[1])
    print(f"Task Statistics: {dict(stats)}")
    if failures:
        print("Failed Tasks:")
        for failure in failures:
            print(f"  {failure}")

Role-Based Tool Usage

Tier 1 Responsibilities

Tier 2/3 Responsibilities

Best Practices for Tool Integration

Version Control Integration

Maintain all automation tools and scripts in version control with proper branching strategies.

Security Considerations

Error Handling

All automation scripts should include comprehensive error handling and logging mechanisms.

#!/bin/bash
# Error handling example
set -euo pipefail

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" >&2
}

cleanup() {
    log "Cleaning up temporary files"
    rm -f /tmp/ansible-$$.*
}

trap cleanup EXIT

log "Starting automation process"
# Automation logic here

Change Management & Versioning

Change Management Framework for Ansible

Ansible automation changes require structured change management to prevent service disruptions and ensure rollback capabilities. This section covers change control processes, version management strategies, and approval workflows specific to Ansible deployments.

Change Classification

Standard Changes:

Normal Changes:

Emergency Changes:

Pre-Change Requirements

Documentation Requirements:

Technical Validation:

Version Control Strategy

Repository Structure:

ansible-infrastructure/
├── environments/
│   ├── production/
│   ├── staging/
│   └── development/
├── roles/
├── playbooks/
├── inventories/
└── CHANGELOG.md

Branching Strategy:

Tagging Convention:

Change Approval Workflow

Tier 1 Responsibilities:

Requires Escalation to Tier 2:

Tier 2/3 Responsibilities:

Change Implementation Process

Objective: Execute approved Ansible changes while maintaining system stability and enabling rapid rollback if needed.

Prerequisites:

Implementation Steps:

  1. Verify change approval status and maintenance window
  2. Check out approved playbook version using git tag
  3. Validate inventory targets match change scope
  4. Execute pre-change validation playbook if available
  5. Run main playbook with appropriate limit and verbosity
  6. Monitor execution progress and capture output logs
  7. Execute post-change validation procedures
  8. Update change record with completion status

Expected Result: Successful playbook execution with all tasks completed and validation checks passed.

Validation Steps:

Rollback Procedures

Rollback Triggers:

Rollback Methods:

Rollback Decision Authority:

Change Documentation

Required Documentation:

Post-Implementation Review:

Training Scenario: Emergency Change Management

Scenario: A critical security vulnerability requires immediate patching across 200 web servers. The security team has provided an Ansible playbook, but it hasn't been tested in your environment.

What would you do as Tier 1 support?

  1. Execute the playbook immediately on all servers
  2. Test the playbook on one server first
  3. Escalate to Tier 2 for emergency change approval
  4. Wait for normal change approval process

Correct Answer: Option 3 - Escalate to Tier 2 for emergency change approval.

Reasoning: Emergency changes still require proper authorization and risk assessment. Tier 1 should not execute untested playbooks on production systems, even during emergencies. Tier 2 can expedite the approval process while ensuring proper safeguards.

Common Mistakes:

Escalation Paths & RACI

RACI Matrix for Ansible Operations

The RACI (Responsible, Accountable, Consulted, Informed) matrix defines clear ownership and communication paths for Ansible-related activities across support tiers and organizational roles.

Playbook Development & Maintenance

Production Playbook Execution

Ansible Infrastructure Management

Escalation Triggers by Tier

Tier 1 Escalation Criteria

Tier 1 must escalate immediately when encountering:

Tier 2 Escalation Criteria

Tier 2 must escalate to Tier 3 when:

Tier 3 Escalation Criteria

Tier 3 must escalate to management/vendor when:

Escalation Workflows

Standard Escalation Process

  1. Document current state and attempted resolution steps
  2. Capture relevant log excerpts and error messages
  3. Identify affected systems and business impact
  4. Create escalation ticket with priority classification
  5. Notify receiving tier via established communication channels
  6. Provide verbal handoff within 15 minutes for P1/P2 issues
  7. Remain available for knowledge transfer and updates

Emergency Escalation Process

  1. Immediately contact on-call Tier 3 engineer via phone
  2. Send emergency notification to management chain
  3. Document incident in real-time collaboration tool
  4. Activate incident response bridge if multiple systems affected
  5. Engage vendor support for platform-level issues
  6. Notify business stakeholders of service impact

Communication Protocols

Escalation Communication Requirements

All escalations must include:

Update Frequency Requirements

Decision Authority Matrix

Tier 1 Authority

Tier 2 Authority

Tier 3 Authority

Vendor Escalation Procedures

Red Hat Support Engagement

  1. Verify support entitlement and contract details
  2. Gather sosreport and relevant system information
  3. Open case via Red Hat Customer Portal
  4. Provide detailed problem description and logs
  5. Assign appropriate severity level
  6. Schedule callback if immediate assistance needed

Third-Party Integration Support

  1. Identify affected integration or module
  2. Check community forums and documentation
  3. Engage vendor through appropriate support channel
  4. Provide integration-specific logs and configuration
  5. Coordinate between multiple vendors if necessary

Post-Escalation Procedures

Knowledge Transfer Requirements

Continuous Improvement Process

Known Issues & Limitations

Performance Limitations

Ansible has inherent performance constraints that impact large-scale deployments:

Tier 1 Action: Monitor playbook execution times and escalate if runs exceed expected baselines by 50% or more.

Windows Management Constraints

Windows automation has specific limitations compared to Linux management:

Network Device Automation Issues

Network automation presents unique challenges:

Scalability Boundaries

Ansible reaches practical limits at certain scales:

Security Model Limitations

Ansible's security model has inherent constraints:

Tier 1 Escalation: Immediately escalate any suspected credential exposure or unauthorized privilege usage.

Module-Specific Known Issues

Common problematic modules and their issues:

Error Handling Deficiencies

Ansible's error handling has several gaps:

Common Workarounds

Established patterns to mitigate known limitations:

Version-Specific Issues

Known problems in current Ansible versions:

Tier 1 Validation: Check Ansible version compatibility before troubleshooting module failures.

Escalation Criteria

Escalate to Tier 2 when encountering:

Do Not Touch / Restricted Actions

Critical System Protection

Certain Ansible operations pose significant risk to production systems and require strict access controls. Understanding these restrictions prevents accidental damage and ensures proper escalation procedures.

Tier 1 Restrictions

Tier 1 support staff must NEVER perform the following actions:

High-Risk Operations Requiring Escalation

The following operations always require Tier 2 or higher approval:

Protected Infrastructure Components

These systems require special authorization before any Ansible operations:

Dangerous Ansible Modules

These modules require senior engineer approval:

shell
command (when not using creates/removes parameters)
raw
script
mount/unmount operations
user management with sudo privileges
cron job modifications
systemd service management for critical services
iptables or firewall modifications
package removal operations

Emergency Override Procedures

In critical situations requiring immediate action:

  1. Contact on-call Tier 2 engineer immediately
  2. Document the emergency situation and business impact
  3. Obtain verbal approval with incident ticket number
  4. Execute only the minimum necessary actions
  5. Document all commands executed
  6. Schedule post-incident review within 24 hours

Access Control Validation

Before any Ansible operation, verify:

Escalation Triggers

Immediately escalate when:

Training Scenario

A customer requests immediate deployment of a security patch to all web servers using Ansible. The patch requires restarting the web service. What would you do?

Correct Response: Escalate to Tier 2. This involves production systems, service restarts, and security implications requiring senior approval and proper change management procedures.

Common Mistake: Running the playbook in development first to "test it." Even testing security patches requires proper authorization and may expose sensitive information.

Decommissioning / End-of-Life Procedures

Decommissioning Objectives

Properly decommissioning Ansible components ensures security, compliance, and resource optimization while maintaining operational continuity for remaining systems.

Pre-Decommissioning Assessment

Dependency Analysis

Tier 1 Actions:

Tier 2/3 Escalation Required:

Data Inventory Checklist

Managed Host Decommissioning

Individual Host Removal

Objective: Safely remove a managed host from Ansible control

Prerequisites: Confirmation that host is no longer needed, backup verification complete

Steps:

  1. Remove host from all inventory files
  2. Update any host-specific playbooks or group assignments
  3. Remove host-specific variables from group_vars or host_vars
  4. Clean up any host-specific vault entries
  5. Remove SSH keys from the target host's authorized_keys
  6. Update documentation and runbooks

Validation: Verify host no longer appears in ansible-inventory output and cannot be reached by test playbooks

Bulk Host Decommissioning

# Create decommission playbook
- name: Decommission hosts
  hosts: decommission_group
  tasks:
    - name: Stop managed services
      service:
        name: "{{ item }}"
        state: stopped
      loop: "{{ services_to_stop }}"
    
    - name: Remove automation user
      user:
        name: ansible
        state: absent
        remove: yes
    
    - name: Clear authorized keys
      file:
        path: /home/ansible/.ssh/authorized_keys
        state: absent

Ansible Controller Decommissioning

Data Backup and Migration

Tier 2/3 Responsibility:

Service Shutdown Procedure

Prerequisites: All critical workloads migrated, stakeholder approval obtained

Steps:

  1. Disable all scheduled jobs and workflows
  2. Stop accepting new job submissions
  3. Allow running jobs to complete or safely terminate
  4. Stop Ansible services (ansible-tower, postgresql, redis)
  5. Disable system startup scripts
  6. Remove from load balancers or DNS records

Data Sanitization

Security Requirements:

License and Asset Management

License Reclamation

Tier 1 Actions:

Tier 2 Escalation: License reallocation and contract modifications

Hardware/VM Disposal

Documentation and Knowledge Transfer

Final Documentation Requirements

Knowledge Preservation

Archive critical operational knowledge:

Common Decommissioning Scenarios

Scenario: Emergency Decommissioning

Situation: Security incident requires immediate Ansible controller shutdown

What would you do?

Escalation Trigger: Any security-related decommissioning requires immediate Tier 2/3 involvement

Scenario: Planned Migration

Situation: Migrating from older Ansible version to new platform

Tier 1 Actions:

Post-Decommissioning Validation

Verification Checklist

Common Mistakes to Avoid

FAQ

General Questions

Q: What is the difference between Ansible and other automation tools like Puppet or Chef?

A: Ansible is agentless and uses SSH for communication, making it simpler to deploy. It uses YAML for configuration (playbooks) rather than custom languages, and follows a push-based model rather than pull-based like Puppet or Chef.

Q: Do I need to install anything on target servers?

A: No. Ansible only requires SSH access and Python on target systems. Most Linux distributions include Python by default.

Q: Can Ansible manage Windows servers?

A: Yes. Ansible uses WinRM (Windows Remote Management) instead of SSH for Windows targets and includes Windows-specific modules.

Playbook and Task Questions

Q: Why did my playbook fail with "unreachable" errors?

A: Common causes include SSH connectivity issues, incorrect inventory hostnames/IPs, authentication failures, or target systems being offline. Check network connectivity and SSH key authentication first.

Q: How do I run only specific tasks in a playbook?

A: Use tags. Add tags to tasks and run with --tags tagname or skip tasks with --skip-tags tagname.

Q: What does "changed=0" mean in task output?

A: The task ran successfully but made no changes because the system was already in the desired state (idempotency).

Q: Can I run Ansible playbooks in parallel?

A: Yes. Use the --forks parameter to control parallelism, or set serial in playbooks to control batch sizes.

Inventory and Variables

Q: How do I organize hosts into groups?

A: Create groups in inventory files using bracket notation [groupname] and list hosts underneath. Hosts can belong to multiple groups.

Q: Where should I store sensitive data like passwords?

A: Use Ansible Vault to encrypt sensitive variables. Never store passwords in plain text in playbooks or inventory files.

Q: How do I pass variables to playbooks at runtime?

A: Use --extra-vars "key=value" or -e @filename.yml to load variables from files.

Troubleshooting Questions

Q: My playbook works sometimes but fails other times. Why?

A: This often indicates race conditions, network timeouts, or dependencies on external services. Add appropriate error handling, retries, and wait conditions.

Q: How do I debug failed tasks?

A: Use -vvv for verbose output, add debugger: on_failed to tasks, or use the debug module to print variable values.

Q: Tasks fail with permission errors. What should I check?

A: Verify the SSH user has necessary permissions, consider using become: yes for privilege escalation, and check file/directory ownership and permissions.

Performance and Best Practices

Q: My playbooks run slowly. How can I improve performance?

A: Increase fork count, use pipelining=True in ansible.cfg, minimize fact gathering with gather_facts: no when not needed, and use async tasks for long-running operations.

Q: Should I use roles or playbooks?

A: Use roles for reusable, modular automation (like installing Apache). Use playbooks to orchestrate multiple roles and define specific workflows.

Q: How often should I run playbooks?

A: Depends on requirements. Configuration management playbooks can run frequently due to idempotency. Application deployment playbooks typically run on-demand or via CI/CD triggers.

Security Questions

Q: Is it safe to store SSH keys for Ansible?

A: Use dedicated service accounts with minimal required permissions. Consider SSH agent forwarding or vault-managed credentials rather than storing private keys on disk.

Q: How do I rotate passwords managed by Ansible?

A: Update encrypted variables in Ansible Vault, then run playbooks to apply changes. Coordinate with applications that use those credentials.

Escalation Scenarios

When to escalate to Tier 2/3:

What Tier 1 can handle:

Glossary of Terms

Core Ansible Concepts

Ad-hoc Command: A single Ansible command executed directly from the command line without using a playbook, typically for quick tasks or testing.

Ansible Control Node: The machine where Ansible is installed and from which playbooks, ad-hoc commands, and other Ansible operations are executed.

Ansible Galaxy: A community hub for sharing and downloading Ansible roles, collections, and other content created by the Ansible community.

Ansible Vault: A feature that allows encryption of sensitive data such as passwords, keys, and other secrets within Ansible files.

Collection: A distribution format for Ansible content that includes modules, plugins, roles, and playbooks packaged together with metadata.

Facts: System information automatically gathered by Ansible about managed nodes, including hardware details, network configuration, and operating system information.

Handler: A special type of task that runs only when notified by other tasks, typically used for service restarts or configuration reloads.

Idempotency: The property that allows Ansible tasks to be run multiple times without changing the result beyond the initial application.

Inventory: A list of managed nodes (hosts) that Ansible can connect to and manage, along with variables and grouping information.

Managed Node: A remote system or host that is managed by Ansible, also referred to as a target host.

Module: A reusable, standalone script that performs a specific task on managed nodes, such as installing packages or managing files.

Play: An ordered list of tasks executed against a specific set of hosts defined in the inventory.

Playbook: A YAML file containing one or more plays that define the automation workflow and tasks to be executed.

Role: A way of organizing playbooks and other files in a standardized file structure for reusability and sharing.

Task: A single unit of work in Ansible that calls a module with specific parameters to perform an action on managed nodes.

Execution and Control

Become: Ansible's privilege escalation system that allows tasks to run with elevated permissions (sudo, su, etc.).

Connection Plugin: Components that handle communication between the control node and managed nodes using protocols like SSH, WinRM, or local connections.

Delegation: The ability to run a task on a different host than the one currently being processed in the play.

Fork: The number of parallel processes Ansible uses to communicate with managed nodes simultaneously.

Gather Facts: The automatic collection of system information from managed nodes at the beginning of play execution.

Serial: A playbook directive that controls how many hosts in a group are processed at the same time during play execution.

Strategy: The method Ansible uses to execute tasks across multiple hosts, such as linear (default) or free strategy.

Variables and Templates

Group Variables: Variables that apply to all hosts within a specific inventory group, typically defined in group_vars directories.

Host Variables: Variables that apply to individual hosts, typically defined in host_vars directories or directly in inventory files.

Jinja2: The templating engine used by Ansible for variable substitution and conditional logic in templates and playbooks.

Magic Variables: Special variables automatically provided by Ansible that contain information about the current execution context.

Register: A task parameter that captures the output of a task execution and stores it in a variable for later use.

Template: A file that contains variables and expressions that get processed by the Jinja2 templating engine to generate final configuration files.

Configuration and Files

ansible.cfg: The main configuration file that controls Ansible's behavior, including default settings and operational parameters.

Dynamic Inventory: Inventory information generated automatically from external sources like cloud providers or CMDBs rather than static files.

Inventory Plugin: Components that enable Ansible to pull inventory information from various sources and formats.

Static Inventory: Inventory information defined in static files, typically in INI or YAML format.

Advanced Features

Callback Plugin: Components that respond to events during playbook execution, enabling custom logging, notifications, or integrations.

Conditional: Logic that determines whether a task should be executed based on variables, facts, or other conditions using 'when' statements.

Loop: A construct that allows a task to be executed multiple times with different values, replacing the older 'with_items' syntax.

Lookup Plugin: Components that allow Ansible to access data from external sources during playbook execution.

Tag: Labels assigned to tasks, plays, or roles that allow selective execution of specific parts of a playbook.

Error Handling and Control Flow

Block: A way to group tasks together for error handling, allowing rescue and always sections for exception management.

Failed When: A task parameter that defines custom conditions for when a task should be considered failed.

Ignore Errors: A task parameter that allows playbook execution to continue even if the task fails.

Rescue: A section within a block that executes when tasks in the block fail, similar to a catch block in programming.

References & Further Reading

Official Documentation

Learning Resources

Community Resources

Technical References

Security and Compliance

Integration Documentation

Troubleshooting Resources

Certification and Training Paths

Version-Specific Documentation

Quick Reference Cards