Ansible is an open-source automation platform that enables infrastructure as code, configuration management, application deployment, and orchestration across diverse IT environments. As an agentless automation tool, Ansible uses SSH for Linux/Unix systems and WinRM for Windows systems to execute tasks remotely without requiring software installation on managed nodes.
Ansible addresses four primary automation domains within enterprise IT operations:
Organizations implement Ansible to achieve measurable improvements in operational efficiency and reliability:
Ansible operates through a control node that executes automation against managed nodes using declarative language constructs. The architecture consists of:
Tier 1 Support: Execute pre-approved playbooks, monitor automation job status, and escalate failures following documented procedures.
Tier 2 Support: Troubleshoot playbook failures, modify existing automation, create simple playbooks, and manage inventory updates.
Tier 3 Support: Design complex automation workflows, develop custom modules, implement security policies, and architect enterprise Ansible deployments.
This training enables technical staff to effectively operate, troubleshoot, and extend Ansible automation within enterprise environments. Upon completion, participants will demonstrate competency in playbook execution, basic troubleshooting, and escalation procedures appropriate to their support tier.
This training is designed for IT operations professionals who need to understand, deploy, or troubleshoot Ansible automation in enterprise environments. The content assumes basic Linux command-line proficiency and fundamental networking concepts.
Participants should have:
This training covers operational deployment and management of Ansible in production environments. The focus is on practical implementation rather than development of complex automation logic.
Tier 1 engineers will learn to:
Senior operations staff will learn to:
Upon completion, participants will be able to:
Hands-on exercises require:
Tier 1 support handles initial incident response and basic operational tasks that require minimal Ansible expertise.
Tier 2 support handles complex troubleshooting, playbook analysis, and configuration modifications requiring intermediate Ansible knowledge.
Tier 3 support handles expert-level issues, architecture decisions, and strategic automation development requiring deep Ansible expertise.
When escalating between tiers, always include:
This learning path provides a structured progression through Ansible concepts and skills, designed for technical professionals moving from basic automation tasks to advanced enterprise implementations.
Objective: Establish core understanding of Ansible architecture and terminology
Prerequisites: Basic Linux command line knowledge, SSH familiarity
Duration: 8-12 hours
Validation Exercise: Create a simple inventory file with 3 test servers and execute ansible --version command against all hosts.
Objective: Execute immediate tasks without playbooks
Prerequisites: Module 1 completion
Duration: 6-8 hours
Decision Prompt: You need to check disk space on 50 servers immediately. What would you do?
Answer: Use ad-hoc command: ansible all -m shell -a "df -h"
Objective: Create structured, repeatable automation scripts
Prerequisites: Module 2 completion
Duration: 12-16 hours
Scenario Example: Create a playbook that installs Apache, starts the service, and deploys a custom index.html file only on web servers in the inventory.
Common Mistake: Forgetting to use become: yes for tasks requiring root privileges. Always validate privilege requirements before execution.
Objective: Implement complex logic and error handling
Prerequisites: Module 3 completion
Duration: 10-14 hours
Validation Exercise: Build a playbook with error handling that attempts to start a service, captures failure, and sends notification on error.
Objective: Organize infrastructure and manage configuration data
Prerequisites: Module 4 completion
Duration: 8-10 hours
Decision Prompt: You have database passwords that need to be used in playbooks but kept secure. What approach would you use?
Answer: Use Ansible Vault to encrypt sensitive variables in separate files, referenced in playbooks.
Objective: Structure reusable automation components
Prerequisites: Module 5 completion
Duration: 12-16 hours
Scenario Example: Convert an existing playbook into a reusable role that can be shared across multiple projects with different variable inputs.
Objective: Implement Ansible in production environments
Prerequisites: Module 6 completion
Duration: 14-18 hours
Tier 1 Support Track: Modules 1-3, focus on executing existing playbooks and basic troubleshooting
Tier 2 Administrator Track: Modules 1-6, emphasis on playbook development and role creation
Tier 3 Architect Track: All modules, including enterprise integration and advanced optimization techniques
Expected Completion Timeline: 8-12 weeks for full track completion with hands-on practice between modules.
Objective: Deploy and configure Apache web servers across multiple hosts using Ansible playbooks.
Prerequisites:
Scenario: Your organization needs to deploy Apache web servers on three CentOS hosts with custom index pages and firewall rules.
Step-by-step Instructions:
[webservers]
web1.example.com
web2.example.com
web3.example.com
---
- name: Configure web servers
hosts: webservers
become: yes
tasks:
- name: Install Apache
yum:
name: httpd
state: present
- name: Start and enable Apache
systemd:
name: httpd
state: started
enabled: yes
Expected Result: Apache running on all three hosts with custom content accessible via HTTP.
Validation Steps:
curl http://hostnamesystemctl status httpdWhat would you do? If one host fails during playbook execution, how would you troubleshoot and retry only the failed host?
Answer: Use --limit flag to target specific hosts and -vvv for detailed error output. Check SSH connectivity and sudo permissions first.
Objective: Deploy MySQL database servers with security hardening and user management.
Prerequisites:
Scenario: Deploy MySQL on database servers with encrypted root passwords, create application databases, and configure backup users.
Step-by-step Instructions:
ansible-vault create group_vars/dbservers/vault.yml
Expected Result: Secure MySQL installation with application databases and restricted user access.
Validation Steps:
Common Mistakes:
Objective: Create end-to-end application deployment using roles and handlers.
Scenario: Deploy a Python web application with Nginx reverse proxy, including SSL certificates and monitoring configuration.
Step-by-step Instructions:
roles/
├── common/
├── nginx/
├── python-app/
└── monitoring/
What would you do? During deployment, the application fails to start due to a configuration error. How would you rollback and investigate?
Answer: Use tags to run only rollback tasks, check application logs, and validate configuration syntax before redeployment. Implement health checks in playbook.
Objective: Dynamically scale infrastructure based on load requirements using dynamic inventory.
Scenario: Scale web tier by adding new instances and updating load balancer configuration automatically.
Step-by-step Instructions:
Expected Result: Additional capacity available with automated load balancer updates.
Tier 1 Responsibilities:
Escalation Triggers:
Tier 2/3 Responsibilities:
Objective: Execute disaster recovery procedures using Ansible automation.
Scenario: Primary data center is unavailable. Restore services in secondary location using backup configurations and data.
Step-by-step Instructions:
Critical Validation Points:
What would you do? If database restoration fails due to corruption, what immediate actions should you take?
Answer: Immediately escalate to Tier 2, attempt restoration from previous backup point, document failure details, and activate manual procedures if available.
Situation: You execute an Ansible playbook against 20 servers, and it fails on 8 of them with various error messages including "SSH connection timeout," "Permission denied," and "Module not found."
What would you do?
Correct Answer:
ansible all -m ping-vvv flagReasoning: Multiple failure types suggest infrastructure or configuration issues rather than playbook logic problems. Systematic verification of connectivity and permissions addresses the most common failure causes.
Common Mistake: Immediately modifying the playbook code without first verifying basic connectivity and authentication.
Situation: Your Ansible playbook completes with "ok" status on all tasks, but when you check the target servers, the expected configuration changes are not present.
What would you do?
Correct Answer:
check_mode or dry-run parameterswhen clauses aren't preventing execution--diff flag to see what changes would be madeReasoning: "OK" status typically means Ansible detected the desired state already exists, or tasks were skipped due to conditions. This requires investigating why changes weren't applied rather than assuming failure.
Common Mistake: Assuming the playbook is broken when it may be working correctly but conditions prevent changes.
Situation: You run a playbook targeting the "webservers" group, but it executes against database servers instead, or some expected web servers are missing from the execution.
What would you do?
Tier 1 Actions:
ansible-inventory --list to see how groups are resolved-i parameterEscalate to Tier 2 if: Inventory structure requires reorganization or dynamic inventory sources need configuration.
Reasoning: Incorrect host targeting usually stems from inventory configuration issues that can be diagnosed through Ansible's built-in inventory tools.
Situation: An Ansible task starts executing but appears to hang indefinitely without completing or failing. The playbook shows the task as "running" for over 30 minutes.
What would you do?
Immediate Actions:
Escalation Trigger: If task involves custom modules or complex operations requiring code analysis.
Reasoning: Hanging tasks often indicate resource constraints, network issues, or missing timeout configurations rather than Ansible bugs.
Situation: Your playbook uses variables, but when executed, you see literal variable names (like "{{ app_version }}") in configuration files instead of the expected values.
What would you do?
Correct Answer:
Common Mistake: Assuming variables are undefined when they may be defined but not accessible due to scope or precedence issues.
Situation: After adding a new role to your playbook, existing roles begin failing with errors about conflicting handlers or duplicate task names.
Tier 1 Assessment:
Escalate to Tier 2 for: Role refactoring, dependency resolution, or architectural changes to eliminate conflicts.
Reasoning: Role conflicts typically require structural changes that go beyond basic troubleshooting and may impact multiple playbooks.
Establishing clear completion criteria ensures Ansible automation tasks are properly validated and meet operational standards before being considered complete.
Objective: Verify playbook has executed successfully without errors or unexpected failures.
Success Criteria:
Validation Steps:
Objective: Confirm target systems are in the desired configuration state.
Success Criteria:
Validation Commands:
# Service status verification
ansible all -m service -a "name=httpd" --check
# File content verification
ansible all -m command -a "grep 'expected_value' /path/to/config"
# Port connectivity check
ansible all -m wait_for -a "port=80 timeout=10"
Objective: Ensure playbook can be run multiple times without unintended changes.
Success Criteria:
Testing Process:
ansible-playbook playbook.yml --checkObjective: Ensure proper documentation and adherence to organizational standards.
Success Criteria:
Tier 1 Authority:
Requires Tier 2/3 Approval:
Escalate When:
Scenario: Your playbook executed with the following recap:
PLAY RECAP *****************************
web01: ok=5 changed=2 unreachable=0 failed=0
web02: ok=5 changed=0 unreachable=0 failed=0
db01: ok=3 changed=1 unreachable=0 failed=0
Decision Point: Can this be marked as complete?
Correct Assessment: Potentially complete, but requires validation. The different "changed" counts between web01 and web02 need investigation. Verify why web01 had changes while web02 did not - this could indicate configuration drift or a legitimate difference in initial state.
Common Mistake: Marking complete based solely on "failed=0" without investigating why identical systems show different change counts.
Ansible operates on a fundamentally different paradigm than traditional scripting or configuration management tools. Think of Ansible as a declarative language where you describe the desired end state rather than the specific steps to achieve it. This shift from "how to do something" to "what the final result should look like" is critical for understanding Ansible's power and limitations.
Visualize Ansible as having three primary layers:
The control layer never installs agents on target systems. Instead, it pushes temporary Python modules over SSH, executes them, and removes them. This "agentless" model means targets only need SSH access and Python - no persistent Ansible processes run on managed nodes.
Idempotency means running the same Ansible task multiple times produces the same result without unwanted side effects. A properly written Ansible task checks current state before making changes. If the system is already in the desired state, no action occurs. If changes are needed, Ansible applies only what's necessary to reach the target state.
Example mental model: Think of idempotency like a thermostat. You set it to 72°F. If the room is already 72°F, nothing happens. If it's 68°F, heat turns on until it reaches 72°F. Running the "set to 72°F" command repeatedly won't overheat the room.
The inventory is Ansible's map of your infrastructure. It defines not just which systems exist, but how they're grouped and what variables apply to each. Think of inventory as creating logical relationships between physical or virtual resources. A single server might belong to multiple groups simultaneously (webservers, production, east-coast) and inherit variables from each group.
Understanding the hierarchy is essential:
Mental model: Think of a playbook as a recipe book, plays as individual recipes, and tasks as recipe steps. Each recipe (play) might serve different groups of people (host groups) but uses the same basic ingredients and techniques (modules).
Modules are Ansible's building blocks - discrete units of functionality that handle specific operations. Each module is designed to be idempotent and handle error conditions gracefully. Modules abstract the complexity of different operating systems, package managers, and service managers behind consistent interfaces.
Key insight: You don't call system commands directly in well-designed Ansible automation. Instead, you use modules that understand the underlying system differences and handle edge cases appropriately.
Traditional scripts execute commands in sequence. Ansible evaluates desired state and determines necessary actions. This distinction affects how you approach problem-solving:
Ansible's default behavior is to stop execution on a host when a task fails, but continue on other hosts. This "fail fast" approach prevents cascading errors while maintaining parallel execution benefits. Understanding this behavior is crucial for designing robust automation that handles partial failures gracefully.
Variables in Ansible follow a complex precedence hierarchy. Think of variables as having different "weights" - command-line variables override playbook variables, which override inventory variables, which override role defaults. Understanding this hierarchy prevents confusion when the same variable name appears in multiple locations with different values.
Ansible's push model means the control node initiates all actions. This differs from pull-based systems where agents periodically check for updates. The push model provides immediate execution and centralized control but requires the control node to reach all targets. Network connectivity, authentication, and timing all flow from control node to targets, never the reverse.
The Ansible control node serves as the central management point where Ansible is installed and executed. This node contains the Ansible engine, inventory files, playbooks, and configuration files. The control node communicates with managed nodes via SSH (Linux/Unix) or WinRM (Windows) without requiring agent installation on target systems.
Key control node requirements include Python 2.7 or Python 3.5+ and SSH connectivity to managed nodes. The control node can be a physical server, virtual machine, or containerized environment depending on organizational needs.
Managed nodes are target systems that Ansible configures and manages. These nodes require minimal prerequisites: SSH service running, Python interpreter available, and network connectivity to the control node. Managed nodes do not require Ansible installation, making the architecture lightweight and scalable.
When Ansible executes tasks, it copies Python modules to managed nodes temporarily, executes them, and removes them upon completion. This agentless approach reduces maintenance overhead and security surface area.
The Ansible engine consists of several interconnected components:
Ansible's plugin system extends core functionality through modular components:
Ansible follows a push-based architecture where the control node initiates all communication:
Ansible implements security through existing infrastructure components rather than introducing new authentication mechanisms. SSH key-based authentication provides secure, passwordless access to managed nodes. All communication occurs over encrypted channels using standard protocols.
The agentless design eliminates persistent processes on managed nodes, reducing attack surface. Privilege escalation uses existing mechanisms like sudo, su, or runas, maintaining consistency with organizational security policies.
Ansible's architecture supports horizontal scaling through several mechanisms:
Ansible integrates with external systems through multiple touchpoints:
Tier 1 Responsibilities: Monitor control node status, verify SSH connectivity, check basic inventory accessibility, and validate playbook syntax using ansible-playbook --syntax-check.
Escalation Required: Control node configuration changes, plugin installation or modification, SSH key management, and inventory source modifications require Tier 2 involvement.
Tier 2/3 Responsibilities: Architecture design decisions, plugin development, security configuration, and integration with external systems.
Ansible supports multiple authentication mechanisms for connecting to managed nodes and accessing control systems. The primary methods include SSH key-based authentication, password authentication, and integration with external authentication systems.
SSH key-based authentication is the recommended approach for Linux/Unix systems. Ansible uses the control node's SSH client to establish connections, leveraging existing SSH configurations and key pairs. Password authentication serves as a fallback option but requires additional security considerations in production environments.
For Windows systems, Ansible utilizes WinRM (Windows Remote Management) with support for basic authentication, certificate-based authentication, and Kerberos integration for domain environments.
Ansible Tower and AWX provide comprehensive role-based access control (RBAC) systems that govern user permissions and resource access. The framework operates on three core components: users, teams, and roles.
Users represent individual accounts with specific credentials and permissions. Teams group users with similar responsibilities or organizational functions. Roles define permission sets that can be assigned to users or teams for specific resources.
Resource-level permissions control access to inventories, projects, job templates, credentials, and organizations. Permissions cascade through the organizational hierarchy, allowing administrators to implement granular access controls.
The system defines several built-in role types with predefined permission sets:
Ansible Tower stores and manages credentials securely using encryption at rest. Credential types include machine credentials for SSH access, cloud credentials for dynamic inventory, source control credentials for project synchronization, and vault credentials for encrypted variable files.
Machine credentials contain SSH private keys, usernames, passwords, and privilege escalation settings. These credentials can be associated with specific inventories or job templates to automate authentication during playbook execution.
Cloud credentials enable dynamic inventory synchronization and resource provisioning across various cloud platforms. Each cloud provider requires specific credential formats and permission scopes.
Objective: Configure SSH key-based authentication for Ansible managed nodes
Prerequisites: Administrative access to control node, target node credentials, SSH client tools
Steps:
ssh-keygen -t rsa -b 4096 -f ~/.ssh/ansible_keyssh-copy-id -i ~/.ssh/ansible_key.pub user@target_hostssh -i ~/.ssh/ansible_key user@target_hostprivate_key_file = ~/.ssh/ansible_keyansible target_host -m pingExpected Result: Successful SSH connection without password prompts and successful Ansible ping response
Validation: Execute ansible-inventory --list and ansible all -m setup --limit target_host to confirm authentication and fact gathering
Objective: Assign appropriate roles to users for specific resources in Ansible Tower
Prerequisites: System Administrator or Organization Admin permissions, existing user accounts, defined resources
Steps:
Expected Result: User can access assigned resources with specified permission level
Validation: Log in as target user and verify access to assigned resources matches role permissions
Scenario: A new team member needs access to execute existing playbooks for web server maintenance but should not modify configurations.
What would you do? Assign Execute role for specific job templates related to web server maintenance, avoiding Admin or Modify permissions.
Reasoning: Execute permissions allow job template execution while preventing unauthorized modifications to critical automation workflows.
Scenario: SSH authentication fails with "Permission denied (publickey)" error when running playbooks.
What would you do? Verify SSH key permissions (600 for private key), confirm public key installation on target hosts, and check SSH agent configuration.
Reasoning: SSH key authentication requires proper file permissions and key distribution to function correctly.
Tier 1 Responsibilities:
Escalation Required:
Tier 2/3 Responsibilities:
Avoid using shared credentials across multiple users or systems. Each user should have individual authentication credentials for proper audit trails and access control.
Do not store passwords in plain text within playbooks or inventory files. Use Ansible Vault for sensitive data encryption or leverage Tower's credential management system.
Prevent over-privileged access by assigning minimal required permissions. Regular access reviews help identify and remediate excessive permissions over time.
Ensure SSH key rotation follows organizational security policies. Stale or compromised keys create security vulnerabilities in automation systems.
Execute Ansible automation tasks following a systematic approach that ensures reliability, traceability, and proper escalation when issues arise.
Decision Point: Is this a pre-approved, standard playbook execution?
ansible-playbook -i inventory playbook.yml --check --diff
Decision Point: Does the dry run output match expected changes?
ansible-playbook -i inventory playbook.yml
Decision Point: Did the playbook complete successfully without failures?
Tier 1 Responsibility: Complete validation steps and documentation. Workflow complete.
Decision Point: Is this a known, recoverable error with documented resolution?
What would you do? 5 out of 20 target hosts failed during playbook execution.
Correct Action: Document which hosts failed and the specific errors, then escalate to Tier 2. Do not retry without understanding the failure cause.
Reasoning: Partial failures may indicate environmental issues, permission problems, or host-specific configurations that require investigation.
What would you do? Playbook fails immediately with SSH connection errors to all hosts.
Correct Action: Verify network connectivity and SSH access manually to a sample host. If connectivity is confirmed down, escalate as a network issue. If access works manually, escalate as an Ansible configuration issue.
Reasoning: Distinguishing between network and configuration issues helps route the escalation appropriately.
What would you do? Check mode shows the playbook will modify 50 additional files not mentioned in the change request.
Correct Action: Stop the workflow and escalate to Tier 2 with the dry run output. Do not proceed with execution.
Reasoning: Scope creep in automation can have unintended consequences and requires review.
Applies to version(s): Ansible 2.9 through 6.x (ansible-core 2.12-2.15)
What this does: Sets up the primary Ansible control node where playbooks are executed and managed hosts are orchestrated from.
Prerequisites: Root or sudo access on a Linux system, Python 3.8 or higher installed, network connectivity to target managed hosts.
What to avoid: Do not install Ansible directly on production servers that will be managed by Ansible, as this creates circular dependency issues. Avoid using Python 2.x as it is deprecated and unsupported.
GUI method:
CLI method (Bash):
sudo apt update (Ubuntu/Debian) or sudo yum update (RHEL/CentOS)sudo apt install python3-pip or sudo yum install python3-pippip3 install ansibleansible --versionmkdir -p ~/ansible/{playbooks,inventory,roles}ssh-keygen -t rsa -b 4096 -f ~/.ssh/ansible_keyWhat to look for: The ansible --version command should display version information including "ansible [core 2.xx.x]" and Python version. SSH key generation should create two files: ansible_key and ansible_key.pub.
How to verify success: Run ansible localhost -m ping and receive "localhost | SUCCESS" with pong response.
If something goes wrong: If "ansible: command not found" appears, add pip's bin directory to PATH with export PATH=$PATH:~/.local/bin. If SSH key generation fails, ensure the .ssh directory exists with mkdir -p ~/.ssh && chmod 700 ~/.ssh.
Applies to version(s): All Ansible versions support INI format inventory; YAML format supported in 2.4+
What this does: Defines which hosts Ansible will manage and organizes them into groups for targeted automation tasks.
Prerequisites: Ansible control node installed, text editor access, knowledge of target host IP addresses or hostnames.
What to avoid: Do not include passwords in plain text inventory files. Avoid using production hostnames in test inventory files to prevent accidental execution against production systems.
GUI method:
CLI method (Bash):
mkdir -p ~/ansible/inventorynano ~/ansible/inventory/hosts[webservers] followed by host entries<hostname_or_ip> ansible_user=<username> under appropriate groupansible-inventory -i ~/ansible/inventory/hosts --listansible -i ~/ansible/inventory/hosts all -m pingWhat to look for: The ansible-inventory --list command should output JSON showing your defined groups and hosts. The ping test should return "SUCCESS" and "pong" for each reachable host.
How to verify success: Run ansible -i ~/ansible/inventory/hosts <group_name> --list-hosts and confirm all expected hosts appear in the output.
If something goes wrong: If "No hosts matched" appears, check inventory file syntax for missing brackets around group names or incorrect indentation. If SSH connection fails, verify the ansible_user has SSH key access with ssh -i ~/.ssh/ansible_key <ansible_user>@<host>.
Applies to version(s): YAML playbook format supported in all current Ansible versions
What this does: Creates automated task sequences that can be executed across multiple managed hosts for configuration management and deployment.
Prerequisites: Ansible control node configured, inventory file created, SSH access to target hosts established.
What to avoid: Do not use become: yes without specifying become_method and become_user in production environments. Avoid hardcoding sensitive values directly in playbook files.
GUI method:
CLI method (Bash):
mkdir -p ~/ansible/playbooksnano ~/ansible/playbooks/basic-setup.yml--- on first line, then - name: <playbook_description>hosts: <group_name_or_all> with proper YAML indentationtasks: followed by task definitions with - name: and module specificationsansible-playbook ~/ansible/playbooks/basic-setup.yml --syntax-checkansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/basic-setup.yml --checkansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/basic-setup.ymlWhat to look for: Syntax check should return "playbook: <filename>" with no errors. Dry-run mode shows "PLAY RECAP" with "changed=X" indicating what would be modified. Actual execution shows "ok", "changed", or "failed" status for each task.
How to verify success: Check the "PLAY RECAP" section shows zero failures and expected number of changed tasks. Run echo $? immediately after playbook execution to confirm exit code 0.
If something goes wrong: If YAML syntax errors appear, check indentation uses spaces not tabs and colons are followed by spaces. If "UNREACHABLE" status appears, verify SSH connectivity and that the ansible_user has appropriate permissions on target hosts.
Applies to version(s): All Ansible versions require SSH access to managed hosts
What this does: Establishes passwordless SSH authentication between the Ansible control node and managed hosts for secure automated access.
Prerequisites: SSH key pair generated on control node, administrative access to target hosts, SSH service running on managed hosts.
What to avoid: Do not use the same SSH key for Ansible that is used for personal administrative access. Avoid copying private keys to multiple control nodes without proper key rotation procedures.
GUI method:
CLI method (Bash):
ssh-copy-id -i ~/.ssh/ansible_key.pub <username>@<target_host>ssh -i ~/.ssh/ansible_key <username>@<target_host>exitnano ~/ansible/inventory/hostsansible_ssh_private_key_file=~/.ssh/ansible_key to host entriesansible -i ~/ansible/inventory/hosts <target_host> -m pingssh-add ~/.ssh/ansible_keyWhat to look for: The ssh-copy-id command should display "Number of key(s) added: 1". SSH login should not prompt for a password. Ansible ping should return "SUCCESS" and "pong" response.
How to verify success: Run ansible -i ~/ansible/inventory/hosts all -m setup -a "filter=ansible_hostname" and receive hostname facts from all managed hosts without password prompts.
If something goes wrong: If "Permission denied (publickey)" appears, verify the public key was added to the correct user's authorized_keys file with ssh <username>@<target_host> "cat ~/.ssh/authorized_keys". If SSH agent errors occur, start the agent with eval $(ssh-agent) before adding keys.
Applies to version(s): Ansible Vault available in Ansible 1.5+ with enhanced features in 2.4+
What this does: Encrypts sensitive data like passwords, API keys, and certificates within Ansible files to maintain security while enabling automation.
Prerequisites: Ansible installed, playbooks or variable files containing sensitive data, secure storage for vault passwords.
What to avoid: Do not store vault passwords in version control systems or plain text files. Avoid using weak passwords for vault encryption or sharing vault passwords through insecure channels.
GUI method:
CLI method (Bash):
ansible-vault create ~/ansible/group_vars/all/vault.ymlecho "<vault_password>" > ~/.ansible_vault_passchmod 600 ~/.ansible_vault_passansible-vault view ~/ansible/group_vars/all/vault.yml --vault-password-file ~/.ansible_vault_passansible-vault edit ~/ansible/group_vars/all/vault.yml --vault-password-file ~/.ansible_vault_passansible-playbook <playbook.yml> --vault-password-file ~/.ansible_vault_passWhat to look for: Encrypted files begin with "$ANSIBLE_VAULT;1.1;AES256" followed by encrypted content. The ansible-vault view command should display decrypted YAML content. Playbook execution should access vaulted variables without errors.
How to verify success: Run cat ~/ansible/group_vars/all/vault.yml to confirm content is encrypted, then verify variables are accessible in playbooks by using debug tasks to display non-sensitive vault variables.
If something goes wrong: If "Decryption failed" appears, verify the correct password is being used and the vault file is not corrupted. If "ERROR! Attempting to decrypt but no vault secrets found" occurs, ensure the --vault-password-file parameter is included in playbook execution commands.
Applies to version(s): Service and package modules available across all current Ansible versions with OS-specific variations
What this does: Automates installation, configuration, and management of system packages and services across multiple hosts for consistent system state.
Prerequisites: Ansible control node configured, managed hosts accessible, sudo privileges configured for the ansible user on target systems.
What to avoid: ⚠️ WARNING Do not use state: absent on critical system packages without testing in non-production environments first. Avoid restarting services during business hours without proper change control approval.
GUI method:
CLI method (Bash):
nano ~/ansible/playbooks/service-management.ymlpackage: module, name: <package_name>, and state: presentservice: module, name: <service_name>, state: started, and enabled: yesbecome: yes at play level for privilege escalationansible-playbook ~/ansible/playbooks/service-management.yml --syntax-checkansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/service-management.yml --checkansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/service-management.ymlansible -i ~/ansible/inventory/hosts all -m service -a "name=<service_name>" --becomeWhat to look for: Package installation shows "changed" status when installing new packages or "ok" when already present. Service tasks display "changed" when starting stopped services or "ok" when already running. Service verification shows "state: started" in the output.
How to verify success: Run ansible -i ~/ansible/inventory/hosts all -m shell -a "systemctl is-active <service_name>" --become and confirm "active" status returned from all hosts.
If something goes wrong: If "BECOME password required" appears, add ansible_become_pass to inventory or use --ask-become-pass flag. If package installation fails with "No package matching" error, verify the package name is correct for the target OS distribution using the appropriate package module (apt, yum, dnf).
Applies to version(s): Setup module available in all Ansible versions with expanded fact collection in 2.0+
What this does: Gathers comprehensive system information from managed hosts for inventory management, compliance reporting, and troubleshooting purposes.
Prerequisites: Ansible control node configured, SSH access to managed hosts, sufficient disk space for fact output storage.
What to avoid: Do not collect facts from large numbers of hosts simultaneously without rate limiting, as this can overwhelm network resources. Avoid storing fact output in version control due to sensitive system information.
GUI method:
CLI method (Bash):
ansible -i ~/ansible/inventory/hosts all -m setupansible -i ~/ansible/inventory/hosts all -m setup -a "filter=ansible_distribution*"ansible -i ~/ansible/inventory/hosts all -m setup --tree ~/ansible/facts/nano ~/ansible/playbooks/fact-report.ymlgather_facts: yes and debug tasks to display specific factsansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/fact-report.ymlansible -i ~/ansible/inventory/hosts all -m setup -a "filter=ansible_hostname,ansible_distribution,ansible_memtotal_mb" | grep -E "(ansible_hostname|ansible_distribution|ansible_memtotal_mb)" > ~/ansible/system-report.txtWhat to look for: Fact collection returns JSON-formatted data with "ansible_facts" containing system information. The --tree option creates individual JSON files named by hostname. Filtered facts show only requested information categories.
How to verify success: Check that fact files exist in the specified directory with ls -la ~/ansible/facts/ and verify JSON content is valid with python3 -m json.tool ~/ansible/facts/<hostname>.
If something goes wrong: If "Permission denied" errors occur during fact collection, verify the ansible user has read access to system files like /proc/meminfo and /etc/os-release. If fact gathering times out, increase the timeout value with -T 30 parameter or reduce the number of target hosts per execution.
Applies to version(s): Jinja2 templating available in all current Ansible versions
What this does: Creates dynamic configuration files using templates that incorporate host-specific variables and facts for consistent yet customized deployments.
Prerequisites: Ansible control node configured, template files created, target directories writable by ansible user, backup strategy for existing configuration files.
What to avoid: ⚠️ WARNING Do not deploy templates to production configuration files without testing and backup procedures. Avoid using undefined variables in templates as this will cause deployment failures.
GUI method:
CLI method (Bash):
mkdir -p ~/ansible/templatesnano ~/ansible/templates/config.conf.j2{{ ansible_hostname }} and {{ custom_variable }}nano ~/ansible/group_vars/all/main.yml and add variable definitionsnano ~/ansible/playbooks/deploy-config.ymltemplate: module with src: config.conf.j2, dest: /path/to/config.conf, and backup: yesansible -i ~/ansible/inventory/hosts <host> -m template -a "src=~/ansible/templates/config.conf.j2 dest=/tmp/test-config.conf" --checkansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/deploy-config.ymlWhat to look for: Template task shows "changed" status when deploying new or modified templates. The backup parameter creates .backup files with timestamps. Check mode displays the rendered template differences.
How to verify success: Run ansible -i ~/ansible/inventory/hosts all -m shell -a "cat /path/to/config.conf" to verify template variables were properly substituted with host-specific values.
If something goes wrong: If "AnsibleUndefinedVariable" errors appear, check that all template variables are defined in group_vars, host_vars, or playbook vars sections. If template deployment fails with permission errors, verify the destination directory exists and the ansible user has write permissions with appropriate become privileges.
Applies to version(s): Ad-hoc command functionality available in all Ansible versions
What this does: Runs immediate commands across multiple hosts for quick troubleshooting, system checks, and emergency response without creating formal playbooks.
Prerequisites: Ansible control node configured, SSH access to target hosts, appropriate privileges for commands being executed.
What to avoid: ⚠️ WARNING Do not execute destructive commands like rm, mkfs, or service stops without explicit approval. Avoid running commands that require interactive input as they will hang indefinitely.
GUI method:
CLI method (Bash):
ansible -i ~/ansible/inventory/hosts all -m shell -a "uptime"ansible -i ~/ansible/inventory/hosts all -m shell -a "df -h"ansible -i ~/ansible/inventory/hosts all -m service -a "name=<service_name>" --becomeansible -i ~/ansible/inventory/hosts all -m setup -a "filter=ansible_memory_mb"ansible -i ~/ansible/inventory/hosts all -m copy -a "src=/local/file dest=/remote/path"ansible -i ~/ansible/inventory/hosts all -m shell -a "systemctl status <service>" --becomeansible -i ~/ansible/inventory/hosts <group_name> -m pingansible -i ~/ansible/inventory/hosts all -m shell -a "long-running-command" -T 60What to look for: Successful commands return "SUCCESS" status with command output. Failed commands show "FAILED" status with error messages. Unreachable hosts display "UNREACHABLE" with connection details.
How to verify success: Check that all expected hosts respond with "SUCCESS" status and review command output for expected results. Use echo $? to verify the ansible command itself completed with exit code 0.
If something goes wrong: If commands timeout, increase the timeout value with -T <seconds> or break complex commands into smaller operations. If "MODULE FAILURE" appears, verify the module name is correct and the target hosts have required dependencies installed (like python for shell module).
Applies to version(s): Logging functionality available in all Ansible versions with enhanced options in 2.0+
What this does: Configures comprehensive logging and monitors Ansible execution for troubleshooting, compliance auditing, and performance analysis.
Prerequisites: Ansible control node configured, write permissions to log directories, log rotation tools available for long-term log management.
What to avoid: Do not log to directories without sufficient disk space as this can fill filesystems. Avoid logging sensitive data like passwords or API keys in verbose mode output.
GUI method:
CLI method (Bash):
mkdir -p ~/ansible/logsexport ANSIBLE_LOG_PATH=~/ansible/logs/ansible.logexport ANSIBLE_DEBUG=Trueecho "export ANSIBLE_LOG_PATH=~/ansible/logs/ansible.log" >> ~/.bashrcansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/<playbook.yml> -vtail -f ~/ansible/logs/ansible.loggrep -i "failed\|error" ~/ansible/logs/ansible.loggrep "PLAY RECAP" ~/ansible/logs/ansible.loglogrotate -f ~/ansible/logrotate.conf (after creating appropriate logrotate configuration)What to look for: Log entries include timestamps, log levels (DEBUG, INFO, WARNING, ERROR), and detailed execution information. Failed tasks appear with "FAILED" status and error details. Successful completions show "PLAY RECAP" with execution statistics.
How to verify success: Confirm log file exists and contains recent entries with ls -la ~/ansible/logs/ and tail ~/ansible/logs/ansible.log. Verify log rotation prevents excessive disk usage.
If something goes wrong: If no logs appear, verify the ANSIBLE_LOG_PATH directory exists and is writable with touch ~/ansible/logs/test.log. If logs contain permission errors, check that the ansible user has appropriate access to create files in the specified log directory and consider using sudo for log directory creation.
Applies to version(s): Ansible 2.9 through 6.x (ansible-core 2.12-2.15)
What this does: Sets up the primary Ansible control node from which all automation tasks will be executed and managed.
Prerequisites: Linux system with Python 3.8+ installed, sudo access, network connectivity to target hosts.
What to avoid: Do not install Ansible on Windows as a control node - it is not supported. Do not use Python 2.7 as it is deprecated and will cause compatibility issues.
GUI method:
CLI method (Bash):
sudo apt update && sudo apt upgrade -y on Ubuntu/Debian or sudo yum update -y on RHEL/CentOSsudo apt install python3-pip -y or sudo yum install python3-pip -ypip3 install ansibleansible --versionmkdir -p ~/ansible/{playbooks,inventory,roles}What to look for: The ansible --version command should display version information including ansible-core version, config file location, and Python version. Directory creation should complete without errors.
How to verify success: Run ansible localhost -m ping and receive a successful pong response with "changed": false status.
If something goes wrong: If pip installation fails, install using package manager with sudo apt install ansible. If Python version conflicts occur, use python3 -m pip install --user ansible to install in user space.
Applies to version(s): All Ansible versions support INI format inventory; YAML format supported in 2.4+
What this does: Defines target hosts and groups that Ansible will manage, enabling organized automation across infrastructure.
Prerequisites: Ansible control node installed, text editor access, knowledge of target host IP addresses or hostnames.
What to avoid: Do not include passwords in plain text inventory files. Do not use spaces in group names as this causes parsing errors.
GUI method:
CLI method (Bash):
nano ~/ansible/inventory/hosts[webservers] followed by host entries<hostname_or_ip> ansible_user=<username>[webservers:vars] and add common variablesansible-inventory -i ~/ansible/inventory/hosts --listWhat to look for: The ansible-inventory --list command should output JSON format showing all hosts organized by groups with no parsing errors.
How to verify success: Run ansible all -i ~/ansible/inventory/hosts --list-hosts to see all managed hosts listed correctly.
If something goes wrong: If parsing fails, check for missing brackets around group names or invalid YAML syntax. If hosts are unreachable, verify SSH connectivity with ssh <username>@<hostname> manually.
Applies to version(s): All Ansible versions - SSH is the default connection method
What this does: Establishes passwordless SSH authentication from control node to managed hosts for secure automated connections.
Prerequisites: SSH client installed on control node, user accounts on target hosts, network connectivity on port 22.
What to avoid: Do not disable SSH host key checking globally in production - this creates security vulnerabilities. Do not use weak SSH key algorithms like DSA.
GUI method:
CLI method (Bash):
ssh-keygen -t rsa -b 4096 -C "ansible-control-node"~/.ssh/id_rsassh-copy-id <username>@<target_host>ssh <username>@<target_host>What to look for: SSH key generation should display key fingerprint and randomart image. ssh-copy-id should show "Number of key(s) added: 1" message.
How to verify success: SSH connection should complete without password prompt, and ansible <target_host> -m ping should return successful pong response.
If something goes wrong: If ssh-copy-id fails, manually append public key content to target host's ~/.ssh/authorized_keys file. If connection is refused, verify SSH service is running with sudo systemctl status ssh on target host.
Applies to version(s): YAML playbook format supported in all modern Ansible versions (2.0+)
What this does: Creates reusable automation scripts that define desired system state and execute tasks across managed infrastructure.
Prerequisites: Ansible installed, inventory configured, SSH authentication working, basic YAML syntax knowledge.
What to avoid: Do not use tabs for indentation in YAML files - use spaces only. Do not run playbooks with --check mode in production without understanding module limitations.
GUI method:
CLI method (Bash):
nano ~/ansible/playbooks/basic-setup.yml--- on first line, then - name: Basic System Setuphosts: all and become: yes for sudo privilegestasks: followed by indented task definitionsansible-playbook -i ~/ansible/inventory/hosts ~/ansible/playbooks/basic-setup.ymlWhat to look for: Playbook execution should show "PLAY RECAP" summary with ok/changed/unreachable/failed counts for each host. Tasks should display "ok" or "changed" status.
How to verify success: All hosts in PLAY RECAP should show 0 unreachable and 0 failed tasks. Run playbook again to verify idempotency with "changed=0" results.
If something goes wrong: If YAML syntax errors occur, use ansible-playbook --syntax-check <playbook.yml> to validate. If tasks fail, add -vvv flag for verbose debugging output.
Applies to version(s): Ansible Vault available in Ansible 1.5+ with AES256 encryption
What this does: Encrypts sensitive data like passwords and API keys within Ansible files to maintain security while enabling automation.
Prerequisites: Ansible installed, playbooks or variable files containing sensitive data, secure password management.
What to avoid: Do not store vault passwords in plain text files or version control. Do not use weak passwords for vault encryption.
GUI method:
CLI method (Bash):
ansible-vault create ~/ansible/vault/secrets.ymlvars_files: ~/ansible/vault/secrets.ymlansible-playbook --ask-vault-pass <playbook.yml>What to look for: Encrypted vault files should contain $ANSIBLE_VAULT;1.1;AES256 header followed by encrypted content. Playbook execution should prompt for vault password.
How to verify success: Run ansible-vault view ~/ansible/vault/secrets.yml and successfully decrypt content with correct password. Playbook should execute without exposing sensitive values in output.
If something goes wrong: If password is forgotten, vault files cannot be recovered - maintain secure password backup. If decryption fails, verify file integrity with ansible-vault view command.
Applies to version(s): Package modules available across all Ansible versions with distribution-specific modules
What this does: Automates software installation, updates, and removal across multiple systems using appropriate package managers.
Prerequisites: Target hosts accessible, appropriate package manager available (apt, yum, dnf), sudo privileges configured.
What to avoid: ⚠️ WARNING Do not use state: latest in production without change control approval as this can cause unexpected updates. Do not mix package managers on the same system.
GUI method:
CLI method (Bash):
nano ~/ansible/playbooks/package-management.ymlapt module with name: <package_name> and state: presentyum or dnf module with same parametersupdate_cache: yes for apt or update_cache: true for yumansible-playbook -i <inventory> ~/ansible/playbooks/package-management.ymlWhat to look for: Tasks should show "changed" status when packages are installed or "ok" when already present. Package cache updates should complete successfully.
How to verify success: Run ansible all -m shell -a "which <package_command>" to verify package installation, or check with distribution-specific commands like dpkg -l <package>.
If something goes wrong: If package not found errors occur, verify package names are correct for target distribution. If permission denied, ensure become: yes is set in playbook and sudo access is configured.
Applies to version(s): Service module available in all Ansible versions with systemd support in 2.2+
What this does: Automates starting, stopping, enabling, and disabling system services across managed infrastructure.
Prerequisites: Target systems with systemd or init system, sudo privileges, services installed on target hosts.
What to avoid: ⚠️ WARNING Do not stop critical services like SSH or networking without console access to target systems. Do not use state: restarted on production services without change approval.
GUI method:
CLI method (Bash):
nano ~/ansible/playbooks/service-management.ymlservice module with name: <service_name> parameterstate: started, stopped, or restarted as requiredenabled: yes to start service at boot or enabled: no to disableansible-playbook -i <inventory> ~/ansible/playbooks/service-management.ymlWhat to look for: Service tasks should show "changed" when service state is modified or "ok" when already in desired state. No error messages about service not found.
How to verify success: Run ansible all -m shell -a "systemctl status <service_name>" to verify service status matches desired configuration.
If something goes wrong: If service not found errors occur, verify service name spelling and that service is installed. If permission errors occur, ensure become: yes is configured and user has sudo access.
Applies to version(s): Setup module and fact gathering available in all Ansible versions
What this does: Gathers detailed system information from managed hosts for inventory, compliance reporting, and conditional task execution.
Prerequisites: Ansible control node configured, target hosts accessible via SSH, basic inventory file created.
What to avoid: Do not disable fact gathering globally with gather_facts: no unless specifically needed for performance, as many modules depend on system facts.
GUI method:
CLI method (Bash):
ansible <hostname> -m setupansible <hostname> -m setup -a "filter=ansible_os_family"gather_facts: yes to playbook header (enabled by default)ansible <hostname> -m setup --tree ~/ansible/facts/{{ ansible_hostname }} or {{ ansible_distribution }}What to look for: Setup module should return JSON output containing system information like OS version, IP addresses, memory, and disk space. No connection or permission errors.
How to verify success: Verify specific facts are collected correctly by running ansible <hostname> -m setup -a "filter=ansible_hostname" and confirming output matches expected system hostname.
If something goes wrong: If fact gathering fails, check SSH connectivity and Python installation on target host. If specific facts are missing, verify the target system supports that information type.
Applies to version(s): Copy, template, and file modules available in all Ansible versions with Jinja2 templating
What this does: Manages configuration files, copies static files, and generates dynamic content using templates across managed systems.
Prerequisites: Source files or templates available on control node, target directory permissions configured, backup strategy for modified files.
What to avoid: ⚠️ WARNING Do not overwrite critical system files without backup enabled using backup: yes. Do not use templates for binary files - use copy module instead.
GUI method:
CLI method (Bash):
nano ~/ansible/playbooks/file-management.ymlcopy module with src: <local_file> and dest: <remote_path>mode: '0644', owner: <username>, and group: <groupname>template module with src: <template.j2> and dest: <remote_path>backup: yes to preserve original filesWhat to look for: File tasks should show "changed" when files are modified or "ok" when already correct. Template tasks should process Jinja2 variables successfully.
How to verify success: Run ansible all -m shell -a "ls -la <target_file>" to verify file exists with correct permissions, or use stat module to check file properties.
If something goes wrong: If permission denied errors occur, verify target directory exists and user has write access. If template errors occur, check Jinja2 syntax and variable definitions in playbook.
Applies to version(s): Logging and debugging features available across all Ansible versions with enhancements in 2.5+
What this does: Provides visibility into playbook execution, identifies failures, and collects diagnostic information for troubleshooting automation issues.
Prerequisites: Ansible playbooks created, log file permissions configured, understanding of Ansible output formats.
What to avoid: Do not use maximum verbosity (-vvvv) in production as it may expose sensitive information in logs. Do not ignore unreachable hosts without investigating connectivity issues.
GUI method:
CLI method (Bash):
ansible-playbook -vvv <playbook.yml>ansible-playbook --syntax-check <playbook.yml>ansible-playbook --check <playbook.yml>export ANSIBLE_LOG_PATH=~/ansible/logs/ansible.logansible-playbook --limit <hostname> <playbook.yml>What to look for: Verbose output should show SSH connections, module execution details, and variable values. Syntax check should report "playbook: <filename> syntax is OK" or specific error locations.
How to verify success: PLAY RECAP should show all hosts with 0 unreachable and 0 failed tasks. Log files should contain detailed execution information without error messages.
If something goes wrong: If tasks fail intermittently, check network connectivity and SSH key authentication. If modules report errors, verify target system has required dependencies and permissions for the specific module operations.
Playbooks are YAML files that define a series of tasks to be executed on target hosts. They represent the core automation workflows in Ansible, combining tasks, variables, handlers, and roles into executable automation scenarios.
Basic playbook structure includes:
Playbooks for standardizing system configurations across multiple servers:
---
- name: Configure web servers
hosts: webservers
become: yes
tasks:
- name: Install Apache
package:
name: httpd
state: present
- name: Start Apache service
service:
name: httpd
state: started
enabled: yes
Automated deployment workflows that handle code updates, service restarts, and validation:
---
- name: Deploy application
hosts: app_servers
vars:
app_version: "{{ version | default('latest') }}"
tasks:
- name: Stop application service
service:
name: myapp
state: stopped
- name: Deploy new version
copy:
src: "/builds/myapp-{{ app_version }}.jar"
dest: "/opt/myapp/myapp.jar"
notify: restart application
Playbooks that implement security policies and compliance requirements:
---
- name: Security hardening
hosts: all
become: yes
tasks:
- name: Update all packages
package:
name: "*"
state: latest
- name: Configure firewall rules
firewalld:
service: ssh
permanent: yes
state: enabled
immediate: yes
Orchestrating complex deployments across multiple environments:
Using when conditions and blocks for environment-specific tasks:
- name: Configure development settings
template:
src: dev-config.j2
dest: /etc/myapp/config.yml
when: environment == "development"
Implementing robust error handling in automation workflows:
- block:
- name: Risky operation
command: /usr/local/bin/risky-command
rescue:
- name: Handle failure
debug:
msg: "Operation failed, initiating recovery"
- name: Recovery action
service:
name: backup-service
state: started
Key execution parameters for different scenarios:
Situation: Critical security vulnerability requires immediate patching across all systems.
Workflow:
What would you do if 10% of systems fail the patch installation?
Answer: Immediately stop the rolling update, isolate failed systems, analyze failure logs, and determine if rollback is necessary while investigating the root cause.
Situation: Scheduled maintenance requires coordinated shutdown of application tiers and database operations.
Workflow:
Escalate to Tier 2 when:
Prevention: Always test playbooks in development environments and use --check mode before production execution.
Prevention: Implement proper rescue blocks and failure conditions for critical tasks that could impact system availability.
Prevention: Use variables and templates to make playbooks reusable across different environments and configurations.
Objective: Verify playbook syntax, connectivity, and prerequisites before executing automation tasks in production environments.
Prerequisites:
Syntax Validation Steps:
ansible-playbook --syntax-check playbook.ymlansible-inventory --listConnectivity Testing:
ansible all -m pingansible webservers -m pingansible all -m setup --becomeExpected Results:
Objective: Execute playbooks in check mode to preview changes without modifying target systems.
Check Mode Execution:
ansible-playbook --check playbook.ymlansible-playbook --check --diff playbook.ymlLimited Scope Testing:
ansible-playbook --limit hostname playbook.yml --checkansible-playbook --limit webservers playbook.yml --checkansible-playbook --tags specific_tag playbook.yml --checkWhat would you do? A dry run shows files being deleted that should remain. Answer: Stop execution, review playbook logic, check conditionals and file paths, validate against requirements before proceeding.
Objective: Execute full playbook runs in non-production environments that mirror production configurations.
Test Environment Validation:
ansible-playbook -i dev_inventory playbook.ymlService Validation Steps:
ansible all -m service -a "name=httpd state=started" --checkansible all -m wait_for -a "port=80 timeout=10"Objective: Safely validate automation results in production environments with minimal risk.
Phased Deployment Testing:
ansible-playbook --limit "webservers[0:2]" -v playbook.ymlPost-Execution Validation:
Insufficient Testing Scope:
Environment Mismatches:
Tier 1 Capabilities:
Escalate to Tier 2 When:
Escalate to Tier 3 When:
Objective: Systematically identify and resolve Ansible automation issues using a structured decision-tree approach.
Prerequisites: Access to Ansible control node, playbook files, and target system logs. Basic understanding of Ansible concepts covered in earlier sections.
Start Here: What type of failure are you experiencing?
Symptom: Ansible commands fail to start or produce "command not found" errors.
Decision Path:
ansible --versionls -la /path/to/inventoryTier 1 Actions: Verify file paths, check basic permissions, validate command syntax
Escalate to Tier 2: Installation issues, complex permission problems, environment configuration
Symptom: "UNREACHABLE" errors, SSH failures, or authentication timeouts.
Decision Path:
ping target_hostnamessh username@target_hostnameansible target_host -m pingCommon Resolution Steps:
Tier 1 Actions: Basic connectivity tests, inventory verification, SSH key validation
Escalate to Tier 2: Network configuration, firewall rules, SSH service configuration, privilege escalation setup
Symptom: YAML parsing errors, "syntax error" messages, playbook won't start.
Decision Path:
ansible-playbook --syntax-check playbook.ymlansible-doc module_nameWhat would you do? You encounter this error: "ERROR! 'become_user' is not a valid attribute for a Play"
Answer: Check indentation - 'become_user' is likely indented at the wrong level. It should be at the same level as 'hosts' and 'tasks', not nested under a task.
Tier 1 Actions: Syntax validation, basic YAML fixes, parameter verification
Escalate to Tier 2: Complex playbook restructuring, custom module issues, advanced templating problems
Symptom: Playbook starts but individual tasks fail with "FAILED" status.
Decision Path:
ansible-playbook -b playbook.yml (if appropriate)Validation Steps:
ansible-playbook -vvv playbook.ymlansible target_host -m module_name -a "parameters"Tier 1 Actions: Read error messages, check basic permissions, verify simple variables
Escalate to Tier 2: Complex permission issues, system configuration problems, advanced templating, custom facts
Symptom: Playbooks run slowly, timeout errors, or hang indefinitely.
Decision Path:
ansible-playbook --forks=5 playbook.ymlTier 1 Actions: Basic performance monitoring, adjust simple settings like forks
Escalate to Tier 2: Network optimization, system resource issues, complex performance tuning
Immediately escalate when encountering:
Expected Result: Issue identified and either resolved at Tier 1 level or properly escalated with complete diagnostic information.
Before installing Ansible, verify your environment meets these minimum requirements:
Install these packages before Ansible installation:
# Ubuntu/Debian
sudo apt update
sudo apt install python3 python3-pip openssh-client
# RHEL/CentOS/Fedora
sudo dnf install python3 python3-pip openssh-clients
# macOS
brew install python3
Target systems must have:
Configure passwordless SSH access for automation:
# Generate SSH key pair on control node
ssh-keygen -t rsa -b 4096 -C "ansible-automation"
# Copy public key to managed nodes
ssh-copy-id username@target-host
# Test connectivity
ssh username@target-host "echo 'SSH connection successful'"
Install required Python libraries:
# Essential packages
pip3 install --user ansible-core
pip3 install --user paramiko # SSH connections
pip3 install --user PyYAML # YAML parsing
# Optional but recommended
pip3 install --user jinja2 # Template engine
pip3 install --user cryptography # Vault encryption
Ensure network connectivity:
Configure sudo access for automation tasks:
# Add user to sudoers with NOPASSWD (on managed nodes)
echo "ansible-user ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ansible-user
# Verify sudo access
sudo -l
Verify prerequisites before proceeding:
python3 --version shows 3.8+Watch for these frequent problems:
Tier 1 Responsibilities:
Escalate to Tier 2 when:
Install and configure Ansible on control nodes to manage infrastructure automation. This section covers installation methods, initial configuration, and deployment verification.
Install using distribution package managers for stable, supported versions.
Red Hat/CentOS/Fedora:
sudo dnf install ansible-core
# or for older systems
sudo yum install ansible
Ubuntu/Debian:
sudo apt update
sudo apt install ansible
Install latest version using Python package manager. Requires Tier 2 approval for production systems.
pip3 install ansible
# or for user-specific installation
pip3 install --user ansible
Create or modify ansible.cfg in project directory or /etc/ansible/ansible.cfg:
[defaults]
inventory = ./inventory
host_key_checking = False
remote_user = ansible
private_key_file = ~/.ssh/ansible_key
timeout = 30
[privilege_escalation]
become = True
become_method = sudo
become_user = root
Create inventory file listing managed nodes:
[webservers]
web1.example.com
web2.example.com
[databases]
db1.example.com ansible_host=192.168.1.100
db2.example.com ansible_host=192.168.1.101
[production:children]
webservers
databases
ssh-keygen -t rsa -b 4096 -f ~/.ssh/ansible_key
# Do not set passphrase for automation use
ssh-copy-id -i ~/.ssh/ansible_key.pub user@target-host
# Repeat for all managed nodes
ansible --version
ansible-playbook --version
Expected Result: Version information displays without errors, showing ansible-core version and Python version.
ansible all -m ping
ansible all -m setup --limit 1
Expected Result: All hosts return "pong" response and system facts display for test host.
ansible all -m command -a "whoami" --become
Expected Result: Returns "root" for all managed nodes.
Symptom: ImportError or module not found errors
Tier 1 Action: Verify Python version with python3 --version. If below 3.8, escalate to Tier 2.
Resolution: Update Python or use virtual environment with correct version.
Symptom: "UNREACHABLE" errors during ping test
Tier 1 Troubleshooting:
chmod 600 ~/.ssh/ansible_keyssh -i ~/.ssh/ansible_key user@targetSymptom: "FAILED" status with permission errors
Tier 1 Actions:
Situation: Setting up Ansible on fresh Linux server for team use.
What would you do?
Common Mistake: Using root user for Ansible operations. Always use dedicated service account with appropriate sudo privileges.
Situation: Separate inventories needed for development, staging, and production.
Tier 1 Approach: Create separate inventory files (dev-inventory, staging-inventory, prod-inventory) and specify using -i flag.
Tier 2 Requirement: Production environment access requires approval and separate SSH keys.
Objective: Verify Ansible infrastructure is operational and ready for daily automation tasks.
Prerequisites: Access to Ansible control nodes and monitoring dashboards.
Expected Result: All systems responsive with no critical errors identified.
Validation: Run ansible all -m ping successfully against sample inventory groups.
Escalation: If more than 10% of managed nodes unreachable or control node resources exceed 80% utilization.
What would you do if a critical daily playbook failed overnight?
Tier 1 Actions: Basic log review, system connectivity checks, standard playbook re-execution.
Escalation Required: Playbook modification, credential issues, infrastructure problems affecting multiple systems.
Objective: Maintain accurate inventory and remove obsolete entries.
Common Mistake: Removing hosts that are temporarily offline for maintenance. Always verify decommission status before deletion.
Objective: Identify performance bottlenecks and optimization opportunities.
Tier 2 Responsibility: Performance analysis and optimization recommendations require deeper Ansible expertise.
Objective: Perform thorough maintenance to ensure long-term system reliability.
Escalation Trigger: Any maintenance activity that could impact production automation requires Tier 2/3 approval.
Scenario: You notice Ansible job queue times increasing during peak hours.
Analysis Steps:
Monthly Requirements:
What would you do if a new team member needs Ansible access?
Correct Answer: Follow established access provisioning procedures, ensure proper training completion, and verify role-appropriate permissions. Never grant administrative access as starting point.
Tier 1 Immediate Actions:
Escalation Required: Infrastructure-wide automation failures, security incidents, or any situation requiring playbook modifications during incident response.
Monitor Ansible automation health, track performance metrics, and configure alerting to ensure reliable automation operations and proactive issue detection.
Configure callback plugins to export metrics to monitoring systems:
# ansible.cfg
[defaults]
callback_plugins = /path/to/callback/plugins
callbacks_enabled = timer, profile_tasks, prometheus
[callback_prometheus]
prometheus_gateway = http://pushgateway:9091
job_name = ansible_playbooks
Implement custom tasks within playbooks to report application-specific metrics:
- name: Report deployment metrics
uri:
url: "http://monitoring-api/metrics"
method: POST
body_format: json
body:
deployment_time: "{{ ansible_date_time.epoch }}"
hosts_updated: "{{ ansible_play_hosts | length }}"
playbook_name: "{{ ansible_playbook }}"
delegate_to: localhost
run_once: true
Configure log forwarding to centralized systems:
# rsyslog configuration for Ansible logs
$template AnsibleLogFormat,"%timestamp% %hostname% ansible: %msg%\n"
if $programname == 'ansible' then /var/log/ansible/ansible.log;AnsibleLogFormat
& stop
groups:
- name: ansible.rules
rules:
- alert: AnsiblePlaybookFailureRate
expr: rate(ansible_playbook_failures_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High Ansible playbook failure rate"
- alert: AnsibleControlNodeDown
expr: up{job="ansible-control"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Ansible control node is unreachable"
What would you do if playbook execution times suddenly increased by 50%?
Correct approach: Start with infrastructure metrics, then drill down to application-level timing data to isolate the bottleneck.
What would you do if seeing sporadic SSH connection failures across multiple hosts?
Ansible automation must maintain comprehensive audit trails to demonstrate compliance with organizational policies, regulatory requirements, and security standards. All automation activities require detailed logging for accountability, forensic analysis, and compliance reporting.
Configure log retention based on compliance requirements:
AWX_TASK_ENV['ANSIBLE_LOG_PATH'] = '/var/log/ansible/ansible.log'
LOGGING_AGGREGATOR_ENABLED = True
LOGGING_AGGREGATOR_HOST = 'siem.company.com'
LOGGING_AGGREGATOR_PORT = 514
LOGGING_AGGREGATOR_TYPE = 'syslog'
LOGGING_AGGREGATOR_PROTOCOL = 'tcp'
# Enable detailed activity stream
ACTIVITY_STREAM_ENABLED = True
ACTIVITY_STREAM_ENABLED_FOR_INVENTORY_SYNC = True
# Configure audit log forwarding
LOGGING = {
'version': 1,
'handlers': {
'audit_file': {
'class': 'logging.handlers.RotatingFileHandler',
'filename': '/var/log/tower/audit.log',
'maxBytes': 1024*1024*100,
'backupCount': 10,
}
}
}
Scenario: Financial system configuration changes require documented approval and audit trail.
What would you do? A playbook needs to modify database configurations on production financial systems.
Correct approach:
Scenario: Healthcare data processing systems require access logging and data handling audit trails.
Required controls:
Objective: Identify compliance violations and security anomalies in Ansible automation activities.
Prerequisites: Access to centralized logging system and audit analysis tools.
Steps:
Expected result: Daily compliance status report with identified violations and remediation actions.
Validation steps:
Mistake: Running playbooks without verbose logging enabled for compliance-sensitive operations.
Prevention: Configure job templates with mandatory verbose logging for regulated systems. Use callback plugins to ensure comprehensive audit trails.
Mistake: Automated changes without clear business justification or approval documentation.
Prevention: Implement workflow approvals with business justification requirements. Link automation jobs to change management tickets.
Escalate when:
When legal or regulatory investigations require audit evidence preservation:
Escalation trigger: Any request for audit evidence preservation must be escalated to Tier 3 and management within 2 hours of notification.
Ansible environments require comprehensive backup strategies covering playbooks, inventory files, configuration data, and execution history. This section focuses on operational backup and recovery procedures for maintaining business continuity.
Perform automated daily backups of all critical Ansible components to ensure recovery capability within defined RTO/RPO targets.
Complete backup archive containing all critical components, successfully transferred to secure storage with verified integrity.
When primary Ansible infrastructure fails, follow these restoration steps to minimize downtime and restore operational capability.
Situation: Primary Ansible control node experiences hardware failure during business hours. Critical automation jobs are scheduled to run within 2 hours.
What would you do?
Correct Response: Follow established disaster recovery runbook, prioritizing restoration of critical automation workflows first. Communicate regularly with stakeholders about recovery progress and expected completion time.
Learn how to integrate Ansible with external systems, APIs, and third-party tools to create comprehensive automation workflows that span multiple platforms and technologies.
The uri module enables Ansible to interact with REST APIs for system integration:
- name: Create user via API
uri:
url: "https://api.example.com/users"
method: POST
headers:
Authorization: "Bearer {{ api_token }}"
Content-Type: "application/json"
body_format: json
body:
username: "{{ new_user }}"
email: "{{ user_email }}"
status_code: [201, 409]
register: api_response
- name: Handle API response
debug:
msg: "User created with ID: {{ api_response.json.id }}"
when: api_response.status == 201
Common API authentication patterns in Ansible:
# Token-based authentication
- name: Get API token
uri:
url: "https://api.example.com/auth/token"
method: POST
body_format: json
body:
username: "{{ vault_api_user }}"
password: "{{ vault_api_pass }}"
register: token_response
- name: Use token for subsequent calls
uri:
url: "https://api.example.com/data"
headers:
Authorization: "Bearer {{ token_response.json.access_token }}"
- name: Query application database
mysql_query:
login_host: "{{ db_host }}"
login_user: "{{ db_user }}"
login_password: "{{ db_password }}"
login_db: "{{ app_database }}"
query: "SELECT status FROM services WHERE name = %s"
positional_args:
- "{{ service_name }}"
register: service_status
- name: Proceed based on database state
include_tasks: deploy_service.yml
when: service_status.query_result[0][0] == 'ready'
- name: Update configuration table
postgresql_query:
db: "{{ postgres_db }}"
login_host: "{{ postgres_host }}"
login_user: "{{ postgres_user }}"
login_password: "{{ postgres_password }}"
query: |
UPDATE config_settings
SET value = %s, updated_at = NOW()
WHERE key = %s
positional_args:
- "{{ new_config_value }}"
- "{{ config_key }}"
- name: Launch EC2 instance and configure
block:
- name: Create EC2 instance
amazon.aws.ec2_instance:
name: "{{ instance_name }}"
image_id: "{{ ami_id }}"
instance_type: "{{ instance_type }}"
security_group: "{{ security_group }}"
vpc_subnet_id: "{{ subnet_id }}"
state: present
register: ec2_result
- name: Wait for instance to be ready
wait_for:
host: "{{ ec2_result.instances[0].public_ip_address }}"
port: 22
timeout: 300
- name: Add to inventory
add_host:
name: "{{ ec2_result.instances[0].public_ip_address }}"
groups: web_servers
- name: Create Azure resource group and VM
block:
- name: Create resource group
azure_rm_resourcegroup:
name: "{{ resource_group }}"
location: "{{ azure_region }}"
- name: Create virtual machine
azure_rm_virtualmachine:
resource_group: "{{ resource_group }}"
name: "{{ vm_name }}"
vm_size: "{{ vm_size }}"
admin_username: "{{ admin_user }}"
ssh_password_enabled: false
ssh_public_keys:
- path: "/home/{{ admin_user }}/.ssh/authorized_keys"
key_data: "{{ ssh_public_key }}"
- name: Query Prometheus for system metrics
uri:
url: "{{ prometheus_url }}/api/v1/query"
method: GET
body_format: form-urlencoded
body:
query: "up{job='{{ service_name }}'}"
register: prometheus_response
- name: Check service health
set_fact:
service_healthy: "{{ prometheus_response.json.data.result | length > 0 }}"
- name: Restart service if unhealthy
systemd:
name: "{{ service_name }}"
state: restarted
when: not service_healthy
- name: Create Grafana dashboard
uri:
url: "{{ grafana_url }}/api/dashboards/db"
method: POST
headers:
Authorization: "Bearer {{ grafana_api_key }}"
Content-Type: "application/json"
body_format: json
body:
dashboard: "{{ dashboard_config }}"
overwrite: true
register: dashboard_result
- name: Clone and deploy from Git
block:
- name: Clone repository
git:
repo: "{{ git_repo_url }}"
dest: "{{ deploy_path }}"
version: "{{ git_branch | default('main') }}"
force: yes
register: git_result
- name: Install dependencies if code changed
command: "{{ install_command }}"
args:
chdir: "{{ deploy_path }}"
when: git_result.changed
- name: Restart application
systemd:
name: "{{ app_service }}"
state: restarted
when: git_result.changed
- name: Deploy to Kubernetes cluster
kubernetes.core.k8s:
state: present
definition:
apiVersion: apps/v1
kind: Deployment
metadata:
name: "{{ app_name }}"
namespace: "{{ k8s_namespace }}"
spec:
replicas: "{{ replica_count }}"
selector:
matchLabels:
app: "{{ app_name }}"
template:
metadata:
labels:
app: "{{ app_name }}"
spec:
containers:
- name: "{{ app_name }}"
image: "{{ container_image }}"
ports:
- containerPort: "{{ app_port }}"
- name: Deploy Docker service
docker_swarm_service:
name: "{{ service_name }}"
image: "{{ docker_image }}"
replicas: "{{ service_replicas }}"
networks:
- "{{ docker_network }}"
env:
DATABASE_URL: "{{ database_connection }}"
publish:
- published_port: "{{ external_port }}"
target_port: "{{ internal_port }}"
- name: Register service in Consul
uri:
url: "{{ consul_url }}/v1/agent/service/register"
method: PUT
body_format: json
body:
ID: "{{ service_id }}"
Name: "{{ service_name }}"
Address: "{{ ansible_default_ipv4.address }}"
Port: "{{ service_port }}"
Check:
HTTP: "http://{{ ansible_default_ipv4.address }}:{{ service_port }}/health"
Interval: "30s"
- name: Retrieve configuration from Consul KV
uri:
url: "{{ consul_url }}/v1/kv/{{ config_path }}"
method: GET
register: consul_config
- name: Send deployment notification
uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
channel: "{{ slack_channel }}"
username: "Ansible Bot"
text: "Deployment of {{ application_name }} to {{ environment }} completed successfully"
attachments:
- color: "good"
fields:
- title: "Version"
value: "{{ deployment_version }}"
short: true
- title: "Environment"
value: "{{ target_environment }}"
short: true
- name: Send deployment report via email
mail:
to: "{{ deployment_team_email }}"
subject: "Deployment Report - {{ application_name }}"
body: |
Deployment Summary:
Application: {{ application_name }}
Environment: {{ target_environment }}
Version: {{ deployment_version }}
Status: {{ deployment_status }}
Deployed services:
{% for service in deployed_services %}
- {{ service.name }}: {{ service.status }}
{% endfor %}
smtp: "{{ smtp_server }}"
Situation: You need to deploy an application that requires database updates, load balancer configuration, and monitoring setup.
What would you do?
Correct Answer: Option 3 - Use coordinated approach
Reasoning: Proper integration requires careful orchestration to ensure dependencies are met and systems remain consistent throughout the deployment process.
Situation: An API call in your playbook returns a 500 error during execution.
What would you do?
Correct Answer: Option 2 - Implement retry logic
Reasoning: Transient API failures are common; retry logic provides resilience while still failing appropriately for persistent issues.
Mistake: Hardcoding API tokens or storing them in plain text
Solution: Always use Ansible Vault for sensitive credentials and implement token refresh logic for long-running operations
Mistake: Not handling partial failures in multi-system operations
Solution: Implement comprehensive error handling with rollback capabilities and clear escalation paths
Mistake: Not validating external system availability before proceeding
Solution: Always include connectivity and health checks before performing integration operations
- name: Validate integration endpoints
uri:
url: "{{ item.health_check_url }}"
method: GET
status_code: 200
loop: "{{ integration_endpoints }}"
register: health_checks
- name: Report integration status
debug:
msg: "All integrations healthy: {{ health_checks.results | selectattr('status', 'equalto', 200) | list | length == integration_endpoints | length }}"
After completing integration tasks, you should observe:
Escalate to Tier 2/3 when:
Several tools enhance Ansible development and operations workflows. Each serves specific purposes in the automation lifecycle.
Static analysis tool that checks playbooks for best practices and potential issues.
# Install ansible-lint
pip install ansible-lint
# Run against playbook
ansible-lint playbook.yml
# Run against role
ansible-lint roles/webserver/
# Skip specific rules
ansible-lint -x 301,302 playbook.yml
Common lint rules address:
Custom scripts for managing encrypted content in CI/CD pipelines.
#!/bin/bash
# vault-deploy.sh
export ANSIBLE_VAULT_PASSWORD_FILE=/secure/vault-pass
ansible-playbook -i inventory/production deploy.yml --vault-password-file $ANSIBLE_VAULT_PASSWORD_FILE
Tool for testing Ansible roles across multiple scenarios and platforms.
# Initialize molecule in role directory
molecule init scenario
# Run full test cycle
molecule test
# Create test instance
molecule create
# Run converge only
molecule converge
Dynamic inventory scripts pull host information from external sources.
#!/usr/bin/env python3
# aws_inventory.py
import boto3
import json
def get_ec2_inventory():
ec2 = boto3.client('ec2')
response = ec2.describe_instances()
inventory = {'_meta': {'hostvars': {}}}
for reservation in response['Reservations']:
for instance in reservation['Instances']:
if instance['State']['Name'] == 'running':
# Process instance data
pass
return inventory
if __name__ == '__main__':
print(json.dumps(get_ec2_inventory()))
Scripts that standardize deployment processes across environments.
#!/bin/bash
# deploy-wrapper.sh
ENVIRONMENT=$1
PLAYBOOK=$2
EXTRA_VARS=$3
if [[ -z "$ENVIRONMENT" || -z "$PLAYBOOK" ]]; then
echo "Usage: $0 [extra-vars]"
exit 1
fi
# Validate environment
case $ENVIRONMENT in
dev|staging|production)
echo "Deploying to $ENVIRONMENT"
;;
*)
echo "Invalid environment: $ENVIRONMENT"
exit 1
;;
esac
# Set environment-specific variables
INVENTORY="inventory/$ENVIRONMENT"
VAULT_FILE="group_vars/$ENVIRONMENT/vault.yml"
# Execute playbook
ansible-playbook -i $INVENTORY $PLAYBOOK --vault-password-file ~/.vault_pass $EXTRA_VARS
Jenkinsfile examples for Ansible automation in CI/CD pipelines.
pipeline {
agent any
stages {
stage('Lint') {
steps {
sh 'ansible-lint playbooks/'
}
}
stage('Test') {
steps {
sh 'molecule test'
}
}
stage('Deploy') {
when {
branch 'main'
}
steps {
withCredentials([file(credentialsId: 'vault-password', variable: 'VAULT_PASS')]) {
sh 'ansible-playbook -i inventory/production deploy.yml --vault-password-file $VAULT_PASS'
}
}
}
}
}
GitLab CI configuration for automated Ansible deployments.
# .gitlab-ci.yml
stages:
- validate
- test
- deploy
variables:
ANSIBLE_HOST_KEY_CHECKING: "False"
validate:
stage: validate
script:
- ansible-lint playbooks/
- ansible-playbook --syntax-check playbooks/site.yml
test:
stage: test
script:
- molecule test
only:
- merge_requests
deploy_staging:
stage: deploy
script:
- ansible-playbook -i inventory/staging deploy.yml
only:
- develop
deploy_production:
stage: deploy
script:
- ansible-playbook -i inventory/production deploy.yml
when: manual
only:
- main
Custom callback plugins for enhanced logging and monitoring.
# callback_plugins/custom_logger.py
from ansible.plugins.callback import CallbackBase
import json
import requests
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'aggregate'
CALLBACK_NAME = 'custom_logger'
def v2_playbook_on_stats(self, stats):
# Send completion stats to monitoring system
data = {
'hosts': list(stats.processed.keys()),
'ok': stats.ok,
'failures': stats.failures,
'unreachable': stats.dark
}
# Post to monitoring endpoint
requests.post('http://monitoring.example.com/ansible', json=data)
Scripts for parsing and analyzing Ansible execution logs.
#!/usr/bin/env python3
# analyze_logs.py
import re
import sys
from collections import defaultdict
def parse_ansible_log(log_file):
stats = defaultdict(int)
failed_tasks = []
with open(log_file, 'r') as f:
for line in f:
if 'TASK [' in line:
stats['tasks'] += 1
elif 'fatal:' in line:
stats['failures'] += 1
failed_tasks.append(line.strip())
elif 'ok:' in line:
stats['success'] += 1
return stats, failed_tasks
if __name__ == '__main__':
stats, failures = parse_ansible_log(sys.argv[1])
print(f"Task Statistics: {dict(stats)}")
if failures:
print("Failed Tasks:")
for failure in failures:
print(f" {failure}")
Maintain all automation tools and scripts in version control with proper branching strategies.
All automation scripts should include comprehensive error handling and logging mechanisms.
#!/bin/bash
# Error handling example
set -euo pipefail
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" >&2
}
cleanup() {
log "Cleaning up temporary files"
rm -f /tmp/ansible-$$.*
}
trap cleanup EXIT
log "Starting automation process"
# Automation logic here
Ansible automation changes require structured change management to prevent service disruptions and ensure rollback capabilities. This section covers change control processes, version management strategies, and approval workflows specific to Ansible deployments.
Standard Changes:
Normal Changes:
Emergency Changes:
Documentation Requirements:
Technical Validation:
Repository Structure:
ansible-infrastructure/
├── environments/
│ ├── production/
│ ├── staging/
│ └── development/
├── roles/
├── playbooks/
├── inventories/
└── CHANGELOG.md
Branching Strategy:
Tagging Convention:
Tier 1 Responsibilities:
Requires Escalation to Tier 2:
Tier 2/3 Responsibilities:
Objective: Execute approved Ansible changes while maintaining system stability and enabling rapid rollback if needed.
Prerequisites:
Implementation Steps:
Expected Result: Successful playbook execution with all tasks completed and validation checks passed.
Validation Steps:
Rollback Triggers:
Rollback Methods:
Rollback Decision Authority:
Required Documentation:
Post-Implementation Review:
Scenario: A critical security vulnerability requires immediate patching across 200 web servers. The security team has provided an Ansible playbook, but it hasn't been tested in your environment.
What would you do as Tier 1 support?
Correct Answer: Option 3 - Escalate to Tier 2 for emergency change approval.
Reasoning: Emergency changes still require proper authorization and risk assessment. Tier 1 should not execute untested playbooks on production systems, even during emergencies. Tier 2 can expedite the approval process while ensuring proper safeguards.
Common Mistakes:
The RACI (Responsible, Accountable, Consulted, Informed) matrix defines clear ownership and communication paths for Ansible-related activities across support tiers and organizational roles.
Tier 1 must escalate immediately when encountering:
Tier 2 must escalate to Tier 3 when:
Tier 3 must escalate to management/vendor when:
All escalations must include:
Ansible has inherent performance constraints that impact large-scale deployments:
Tier 1 Action: Monitor playbook execution times and escalate if runs exceed expected baselines by 50% or more.
Windows automation has specific limitations compared to Linux management:
Network automation presents unique challenges:
Ansible reaches practical limits at certain scales:
Ansible's security model has inherent constraints:
Tier 1 Escalation: Immediately escalate any suspected credential exposure or unauthorized privilege usage.
Common problematic modules and their issues:
Ansible's error handling has several gaps:
Established patterns to mitigate known limitations:
Known problems in current Ansible versions:
Tier 1 Validation: Check Ansible version compatibility before troubleshooting module failures.
Escalate to Tier 2 when encountering:
Certain Ansible operations pose significant risk to production systems and require strict access controls. Understanding these restrictions prevents accidental damage and ensures proper escalation procedures.
Tier 1 support staff must NEVER perform the following actions:
The following operations always require Tier 2 or higher approval:
These systems require special authorization before any Ansible operations:
These modules require senior engineer approval:
shell
command (when not using creates/removes parameters)
raw
script
mount/unmount operations
user management with sudo privileges
cron job modifications
systemd service management for critical services
iptables or firewall modifications
package removal operations
In critical situations requiring immediate action:
Before any Ansible operation, verify:
Immediately escalate when:
A customer requests immediate deployment of a security patch to all web servers using Ansible. The patch requires restarting the web service. What would you do?
Correct Response: Escalate to Tier 2. This involves production systems, service restarts, and security implications requiring senior approval and proper change management procedures.
Common Mistake: Running the playbook in development first to "test it." Even testing security patches requires proper authorization and may expose sensitive information.
Properly decommissioning Ansible components ensures security, compliance, and resource optimization while maintaining operational continuity for remaining systems.
Tier 1 Actions:
Tier 2/3 Escalation Required:
Objective: Safely remove a managed host from Ansible control
Prerequisites: Confirmation that host is no longer needed, backup verification complete
Steps:
Validation: Verify host no longer appears in ansible-inventory output and cannot be reached by test playbooks
# Create decommission playbook
- name: Decommission hosts
hosts: decommission_group
tasks:
- name: Stop managed services
service:
name: "{{ item }}"
state: stopped
loop: "{{ services_to_stop }}"
- name: Remove automation user
user:
name: ansible
state: absent
remove: yes
- name: Clear authorized keys
file:
path: /home/ansible/.ssh/authorized_keys
state: absent
Tier 2/3 Responsibility:
Prerequisites: All critical workloads migrated, stakeholder approval obtained
Steps:
Security Requirements:
Tier 1 Actions:
Tier 2 Escalation: License reallocation and contract modifications
Archive critical operational knowledge:
Situation: Security incident requires immediate Ansible controller shutdown
What would you do?
Escalation Trigger: Any security-related decommissioning requires immediate Tier 2/3 involvement
Situation: Migrating from older Ansible version to new platform
Tier 1 Actions:
Q: What is the difference between Ansible and other automation tools like Puppet or Chef?
A: Ansible is agentless and uses SSH for communication, making it simpler to deploy. It uses YAML for configuration (playbooks) rather than custom languages, and follows a push-based model rather than pull-based like Puppet or Chef.
Q: Do I need to install anything on target servers?
A: No. Ansible only requires SSH access and Python on target systems. Most Linux distributions include Python by default.
Q: Can Ansible manage Windows servers?
A: Yes. Ansible uses WinRM (Windows Remote Management) instead of SSH for Windows targets and includes Windows-specific modules.
Q: Why did my playbook fail with "unreachable" errors?
A: Common causes include SSH connectivity issues, incorrect inventory hostnames/IPs, authentication failures, or target systems being offline. Check network connectivity and SSH key authentication first.
Q: How do I run only specific tasks in a playbook?
A: Use tags. Add tags to tasks and run with --tags tagname or skip tasks with --skip-tags tagname.
Q: What does "changed=0" mean in task output?
A: The task ran successfully but made no changes because the system was already in the desired state (idempotency).
Q: Can I run Ansible playbooks in parallel?
A: Yes. Use the --forks parameter to control parallelism, or set serial in playbooks to control batch sizes.
Q: How do I organize hosts into groups?
A: Create groups in inventory files using bracket notation [groupname] and list hosts underneath. Hosts can belong to multiple groups.
Q: Where should I store sensitive data like passwords?
A: Use Ansible Vault to encrypt sensitive variables. Never store passwords in plain text in playbooks or inventory files.
Q: How do I pass variables to playbooks at runtime?
A: Use --extra-vars "key=value" or -e @filename.yml to load variables from files.
Q: My playbook works sometimes but fails other times. Why?
A: This often indicates race conditions, network timeouts, or dependencies on external services. Add appropriate error handling, retries, and wait conditions.
Q: How do I debug failed tasks?
A: Use -vvv for verbose output, add debugger: on_failed to tasks, or use the debug module to print variable values.
Q: Tasks fail with permission errors. What should I check?
A: Verify the SSH user has necessary permissions, consider using become: yes for privilege escalation, and check file/directory ownership and permissions.
Q: My playbooks run slowly. How can I improve performance?
A: Increase fork count, use pipelining=True in ansible.cfg, minimize fact gathering with gather_facts: no when not needed, and use async tasks for long-running operations.
Q: Should I use roles or playbooks?
A: Use roles for reusable, modular automation (like installing Apache). Use playbooks to orchestrate multiple roles and define specific workflows.
Q: How often should I run playbooks?
A: Depends on requirements. Configuration management playbooks can run frequently due to idempotency. Application deployment playbooks typically run on-demand or via CI/CD triggers.
Q: Is it safe to store SSH keys for Ansible?
A: Use dedicated service accounts with minimal required permissions. Consider SSH agent forwarding or vault-managed credentials rather than storing private keys on disk.
Q: How do I rotate passwords managed by Ansible?
A: Update encrypted variables in Ansible Vault, then run playbooks to apply changes. Coordinate with applications that use those credentials.
When to escalate to Tier 2/3:
What Tier 1 can handle:
Ad-hoc Command: A single Ansible command executed directly from the command line without using a playbook, typically for quick tasks or testing.
Ansible Control Node: The machine where Ansible is installed and from which playbooks, ad-hoc commands, and other Ansible operations are executed.
Ansible Galaxy: A community hub for sharing and downloading Ansible roles, collections, and other content created by the Ansible community.
Ansible Vault: A feature that allows encryption of sensitive data such as passwords, keys, and other secrets within Ansible files.
Collection: A distribution format for Ansible content that includes modules, plugins, roles, and playbooks packaged together with metadata.
Facts: System information automatically gathered by Ansible about managed nodes, including hardware details, network configuration, and operating system information.
Handler: A special type of task that runs only when notified by other tasks, typically used for service restarts or configuration reloads.
Idempotency: The property that allows Ansible tasks to be run multiple times without changing the result beyond the initial application.
Inventory: A list of managed nodes (hosts) that Ansible can connect to and manage, along with variables and grouping information.
Managed Node: A remote system or host that is managed by Ansible, also referred to as a target host.
Module: A reusable, standalone script that performs a specific task on managed nodes, such as installing packages or managing files.
Play: An ordered list of tasks executed against a specific set of hosts defined in the inventory.
Playbook: A YAML file containing one or more plays that define the automation workflow and tasks to be executed.
Role: A way of organizing playbooks and other files in a standardized file structure for reusability and sharing.
Task: A single unit of work in Ansible that calls a module with specific parameters to perform an action on managed nodes.
Become: Ansible's privilege escalation system that allows tasks to run with elevated permissions (sudo, su, etc.).
Connection Plugin: Components that handle communication between the control node and managed nodes using protocols like SSH, WinRM, or local connections.
Delegation: The ability to run a task on a different host than the one currently being processed in the play.
Fork: The number of parallel processes Ansible uses to communicate with managed nodes simultaneously.
Gather Facts: The automatic collection of system information from managed nodes at the beginning of play execution.
Serial: A playbook directive that controls how many hosts in a group are processed at the same time during play execution.
Strategy: The method Ansible uses to execute tasks across multiple hosts, such as linear (default) or free strategy.
Group Variables: Variables that apply to all hosts within a specific inventory group, typically defined in group_vars directories.
Host Variables: Variables that apply to individual hosts, typically defined in host_vars directories or directly in inventory files.
Jinja2: The templating engine used by Ansible for variable substitution and conditional logic in templates and playbooks.
Magic Variables: Special variables automatically provided by Ansible that contain information about the current execution context.
Register: A task parameter that captures the output of a task execution and stores it in a variable for later use.
Template: A file that contains variables and expressions that get processed by the Jinja2 templating engine to generate final configuration files.
ansible.cfg: The main configuration file that controls Ansible's behavior, including default settings and operational parameters.
Dynamic Inventory: Inventory information generated automatically from external sources like cloud providers or CMDBs rather than static files.
Inventory Plugin: Components that enable Ansible to pull inventory information from various sources and formats.
Static Inventory: Inventory information defined in static files, typically in INI or YAML format.
Callback Plugin: Components that respond to events during playbook execution, enabling custom logging, notifications, or integrations.
Conditional: Logic that determines whether a task should be executed based on variables, facts, or other conditions using 'when' statements.
Loop: A construct that allows a task to be executed multiple times with different values, replacing the older 'with_items' syntax.
Lookup Plugin: Components that allow Ansible to access data from external sources during playbook execution.
Tag: Labels assigned to tasks, plays, or roles that allow selective execution of specific parts of a playbook.
Block: A way to group tasks together for error handling, allowing rescue and always sections for exception management.
Failed When: A task parameter that defines custom conditions for when a task should be considered failed.
Ignore Errors: A task parameter that allows playbook execution to continue even if the task fails.
Rescue: A section within a block that executes when tasks in the block fail, similar to a catch block in programming.