Cross-Functional Architecture And Tools For Cloud-Based Operating Models
The Agile Cloud Manager is intended to be used inside a pipeline tool.
You can continue to use your current pipeline tool because any pipeline tool can potentially be used as the graphical user interface for the Agile Cloud Manager.
Reading the logs produced by the Agile Cloud Manager during runtime will become one of the most important tasks that your engineers perform. Your operations teams will use the logs to diagnose day to day performance of the pipelines. Your platform engineering teams will also use the logs to develop the templates that will compose your appliances.
The Agile Cloud Manager’s logs consolidate output from many different underlying tools and organize that underlying output within the structure of the workflow that the Agile Cloud Manager creates to execute any given CLI Command.
Using the logs is a 4-step process which we will summarize in this article as follows:
Logs are written both to your pipeline tool’s logs and also to specific log file locations within each agent’s file structure.
Your pipeline tool’s graphical user interface will organize logs by job and then by step within each job.
The Agile Cloud Manager will create new logs each time one of the 12 CLI commands is run.
Therefore, you can locate the Agile Cloud Manager’s logs by navigating in your pipeline tool’s graphical user interface first to each relevant job and then to each step that calls one of the Agile Cloud Manager’s CLI commands within each relevant job.
Alternatively, the Agile Cloud Manager also writes two different log files to each agent’s directory structure. In Windows agents, logs are written to $USER_HOME\\acm\\logs\\
. In Linux agents, logs are written to /var/log/acm/
. The two native log files created within these log directories by the Agile Cloud Manager are:
log-verbose.log
contains the same exact information that will be printed to your pipeline tool’s job/step logs. But the identical information in log-verbose.log
should always be printed in the precise order in which each event occurred, while the version of the same information in your pipeline tool’s logs will sometimes be out of chronological order.log-acm-summary.log
contains only the highest-level summary for easier reading.A new copy of log-verbose.log
and log-acm-summary.log
will be created every time you run an Agile Cloud Manager CLI command, and a new backup of the previously new version of the log file will be created each time using the <date-time-stamp>
value from the log being backed up, so that the logs directory on the agent might contain numerous log files with names in the log-verbose<date-time-stamp>.log
and log-acm-summary<date-time-stamp>.log
formats.
To access these logs in the agent directory structure, your provisioning scripts for each agent must map the locations of these log files in each agent to a network file share that will persist after each agent is destroyed when each job is completed.
Your pipeline tool’s graphical user interface will be a sufficient source of logs in most cases because the slight differences in chronological ordering are usually not an impediment to the diagnostic use of logs. But the Agile Cloud Manager’s own native log files are available if you need them.
The first thing you will notice in the logs is that the first block at the left of each line is an indicator of which tool is producing the logs. For example, if Agile Cloud Manager workflow code is being summarized in the line, then the line will begin with [ acm ]
. By contrast, if a shell command is being summarized in the line, then the line will begin with [ shell ]
. If a terraform command is being summarized in the line the line will begin with [ terraform ]
, and so on.
A summary of some of the values in the start of each line includes:
[ acm ] Agile Cloud Manager
[ shell ] Shell
[ terraform ] Terraform
[ packer ] Packer
[ arm ] ARM
[ cf ] Cloud Formation
[ az-cli ] Azure CLI
[ … ] Others can be specified explicitly.
The logs are structured by breakpoints that summarize the progress of the workflows that get created by the Agile Cloud Manager when you run any of the CLI commands.
To identify the point in a workflow where something of interest occurred, you must therefore understand the structure of the workflow whose breakpoints provide structure to the Agile Cloud Manager’s logs.
There are 12 possible types of steps in any workflow that the Agile Cloud Manager will create to execute one of the CLI commands on an appliance configuration that you provide. These 12 types of possible steps are listed as follows:
Start of appliance run
Start of each system within appliance
Start of foundation (if exists) within each system
End of foundation (if exists)
Start of ServiceTypes
Start of each ServiceType
Start each instance of each ServiceType
End of each instance of each ServiceType
End of each ServiceType
End of all ServiceTypes
End of each system within appliance
End of appliance run
Examining the list of possible types of steps illustrates several aspects of the logs, including:
Good design will result in each appliance being composed of relatively small numbers of system templates, service types, and service instances.
The number of steps in any given log file will directly correspond to the number of objects defined in each of the system template files that are referenced in the acm.yaml file.
The number of steps in the current workflow will be printed into the logs.
The following summaries will be printed to the logs at the workflow breakpoints where each of the workflow steps has just been completed:
A human readable summary is given at each point in the workflow. This human-readable summary contains the same information given in the Changes Manifest and in the Change Taxonomy, but is very clearly written in plain English.
One example of a human-readable changes summary from a command run by one of our demos is:
[ acm ] APPLIANBCE LEVEL:
[ acm ] command is: on
[ acm ] overallStatus changed from NOT Started to In Process
[ acm ] currentStep did NOT change since the last step and is: 0 out of 1 steps.
[ acm ] SYSTEMS: Each system in the appliance will be summarized one at a time as follows:
[ acm ] tfbackend SUMMARY LEVEL:
[ acm ] name: tfbackend
[ acm ] system summary status did NOT change and is: NOT Started
[ acm ] system summary currentStep did NOT change and is: 0 out of 1 steps.
[ acm ] SERVICES SUMMARY LEVEL:
[ acm ] all services summary status did NOT change and is: NOT Started
[ acm ] all services summary currentStep did NOT change and is: 0 out of 1 steps.
[ acm ] Each type of service is summarized as follows:
[ acm ] tfBackend summary is as follows:
[ acm ] tfBackend summary status did NOT change and is: NOT Started
[ acm ] tfBackend summary currentStep did NOT change and is: 0 out of a total 2 steps.
[ acm ] INSTANCES OF tfBackend SERVICE TYPE ARE:
[ acm ] adminAccounts instance of tfBackend
[ acm ] adminAccounts summary status did NOT change and is: NOT Started
[ acm ] adminAccounts summary currentStep did NOT change and is: 0 out of a total 1 steps.
[ acm ] pipelineAgents instance of tfBackend
[ acm ] pipelineAgents summary status did NOT change and is: NOT Started
[ acm ] pipelineAgents summary currentStep did NOT change and is: 0 out of a total 1 steps.
[ acm ] ////////////////////////////////////////////////////////////////////
[ acm ] CHANGE SUMMARY:
[ acm ] ... overallStatus changed from NOT Started to In Process
If you examine this human-only-readable summary, you will notice several things:
A Changes Manifest will also be reprinted into the log file at the end of every step in the workflow. This Changes Manifest will list each step including the changes that will be made in each step in the entire workflow, along with the current status of each of the changes to be made in each step.
Each of the changes in each step can do one of only two possible things:
In addition, the status of each change will be reported as either True
or False
each time the Changes Manifest is printed into the logs. So that you can see the progression of changes being made as the Agile Cloud Manager steps through the workflow that it creates to execute a CLI command.
The intent is for the Changes Manifest to be both machine-readable and human-readable.
One example of a Changes Manifest from a command run by one of our demos is:
[ acm ] The current status of the 12 changes being made in this run is:
[ acm ] {'changeIndex': 1, 'changeType': 'Start of appliance run', 'key': 'applianceStart', 'changes': [{'affectedUnit': ' appliance', 'Status': 'To In Process', 'Step': 'Same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 2, 'changeType': 'Start of a system', 'key': 'appliance/system:tfbackend', 'changes': [{'affectedUnit': ' appliance', 'Status': 'same', 'Step': '+1', 'changeCompleted': False}, {'affectedUnit': ' appliance/system:tfbackend', 'Status': 'To In Process', 'Step': 'Same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 3, 'changeType': 'Start of a services section', 'key': ' appliance/system:tfbackend', 'changes': [{'affectedUnit': ' appliance/system:tfbackend', 'Status': 'same', 'Step': '+1', 'changeCompleted': False}, {'affectedUnit': ' appliance/system:tfbackend/serviceTypes', 'Status': 'To In Process', 'Step': 'same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 4, 'changeType': 'Start of a serviceType', 'key': ' appliance/system:tfbackend/serviceTypes', 'changes': [{'affectedUnit': ' appliance/system:tfbackend/serviceTypes', 'Status': 'same', 'Step': '+1', 'changeCompleted': False}, {'affectedUnit': ' appliance/system:tfbackend/serviceTypes/tfBackend', 'Status': 'To In Process', 'Step': 'same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 5, 'changeType': 'Start of an instance of a serviceType', 'key': ' appliance/system:tfbackend/serviceTypes/tfBackend', 'changes': [{'affectedUnit': ' appliance/system:tfbackend/serviceTypes/tfBackend', 'Status': 'same', 'Step': '+1', 'changeCompleted': False}, {'affectedUnit': ' appliance/system:tfbackend/serviceTypes/tfBackend/adminAccounts', 'Status': 'To In Process', 'Step': '+1', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 6, 'changeType': 'End of an instance of a serviceType', 'key': ' appliance/system:tfbackend/serviceTypes/tfBackend/adminAccounts', 'changes': [{'affectedUnit': ' appliance/system:tfbackend/serviceTypes/tfBackend/adminAccounts', 'Status': 'To Completed', 'Step': 'same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 7, 'changeType': 'Start of an instance of a serviceType', 'key': 'appliance/system:tfbackend/serviceTypes/tfBackend', 'changes': [{'affectedUnit': 'appliance/system:tfbackend/serviceTypes/tfBackend', 'Status': 'same', 'Step': '+1', 'changeCompleted': False}, {'affectedUnit': 'appliance/system:tfbackend/serviceTypes/tfBackend/pipelineAgents', 'Status': 'To In Process', 'Step': '+1', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 8, 'changeType': 'End of an instance of a serviceType', 'key': 'appliance/system:tfbackend/serviceTypes/tfBackend/pipelineAgents', 'changes': [{'affectedUnit': 'appliance/system:tfbackend/serviceTypes/tfBackend/pipelineAgents', 'Status': 'To Completed', 'Step': 'same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 9, 'changeType': 'End of a serviceType', 'key': 'appliance/system:tfbackend/serviceTypes/tfBackend', 'changes': [{'affectedUnit': 'appliance/system:tfbackend/serviceTypes/tfBackend', 'Status': 'To Completed', 'Step': 'same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 10, 'changeType': 'End of a services section', 'key': 'appliance/system:tfbackend/serviceTypes', 'changes': [{'affectedUnit': 'appliance/system:tfbackend/serviceTypes', 'Status': 'To Completed', 'Step': 'same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 11, 'changeType': 'End of a system', 'key': 'appliance/system:tfbackend', 'changes': [{'affectedUnit': 'appliance/system:tfbackend', 'Status': 'To Completed', 'Step': 'Same', 'changeCompleted': False}]}
[ acm ] {'changeIndex': 12, 'changeType': 'End of appliance run', 'key': 'applianceEnd', 'changes': [{'affectedUnit': 'appliance', 'Status': 'To Completed', 'Step': 'Same', 'changeCompleted': False}]}
As you can see, the Changes Manifest gives many lines of simple JSON that is easy enough to be read by a human. Each line tells you a specific step, with one or more smaller changes at each step, and with a changeCompleted
field marked either True or False.
The example above is the first printout in a log file so that changeCompleted
is False in every step at the very start of a run.
The logs will print a new copy of the Changes Manifest at every step, so that if you examine the logs, you will see that each new copy of the Changes Manifest marks the changes in one more step as changeCompleted:True
until the very last step shows all steps and all changes as changeCompleted:True
.
To diagnose a problem, you can therefore look through the logs for a human-readable changes summary marked by the [ acm ] at the start of each line and containing the distinctive structure shown above. The human-readable summary stands out due to its large human-readable nature and will clearly tell you the status of the Agile Cloud Manager workflow at each point.
You can then further examine the Changes Manifest to identify which step at each point is the last step to be marked as completed.
The step after the last completed step would be the step where something broke, if anything broke.
Once you know the point in the workflow, you scroll down from the last human-readable changes summary and examine the outputs of each command run against underlying tools to find the last command that ran along with any error message that might have been printed.
Each underlying tool will have a unique pattern which will repeat in all its logs, so that you can learn to use the information to debug issues that arise in the performance of underlying tools.
For example, the command that the Agile Cloud Manager runs against a given tool will be printed in the logs along with any required information about the directory in which the command is being run.
Platform engineers developing with the Agile Cloud Manager can navigate their terminals to the directory given in the logs and can paste in the underlying 3rd party tool command that was run when the pipeline broke. This should give adequate information required to identify the root cause of the problem so that it can be fixed.
Platform engineers should be able to fix any underlying problems during development, so that underlying problems should be resolved before each template is elevated to higher environments.
Some problems occur when third party systems have outages elsewhere on the internet. Other problems occur when credentials expire, and for other reasons that have nothing to do with the template itself.
Operations engineers who use the Agile Cloud Manager’s logs in pipelines can identify whether a problem simply requires re-running the job that broke in a pipeline, or whether other changes might need to be made, such as potentially updating credentials, or potentially deploying to a different cloud region if a cloud provider is having a regional outage.