Practical Ideas for Ransomware Resilience in VMware vSphere Environments

Introduction

Ransomware is a serious threat, and with large-scale attacks making the national news there are a lot of organizations asking, “how do we protect ourselves?” This is a practical discussion about things you, as a vSphere administrator, may want to consider to make your environment resilient in the face of these attacks.

As you read these, you’ll note that some of these ideas are not technical controls. Instead, they are people & process changes. Ransomware attacks exploit all gaps in an organization’s security. Defending against these attacks involves a mix of approaches, from a return to information security basics, inclusion of new technology and defense tactics, and organizational procedural changes. In many ways the process changes are the hardest, because they are cultural changes, and cultural change is the hardest type of change to make. The “if it ain’t broke don’t fix it” mentality is still pervasive in most organizations, and proactive change under that attitude is hard, at least until a very public breach occurs. 

Most of the ideas in this paper have resiliency benefits in other ways than just ransomware defense. A great example is patching. Everyone loves to hate patching, and indeed, it is a chore that is thankless. However, regular patching pays other dividends:

  • It encourages system resiliency, because patching is just a scheduled outage. If a system cannot survive patching it won’t survive other types of disruptions.
  • Patching encourages application administrators to have their applications start automatically, which is a huge help during non-ransomware outages. Not having to call all your application admins after a power outage is wonderful!
  • Regular patching encourages automation, which everyone loves to talk about but rarely actually implements, because they’re too busy “fighting fires.” Automation can help put those fires out, and free staff to work on things that actually move the organization forward.
  • Best of all, patching encourages everyone to talk to each other. Security is everyone’s responsibility, and when there’s good communication there are good outcomes, too. History can be informative, but avoid blame, and work to make things better regardless of how you got to where you are now.

Is this paper a complete discussion of ransomware? Absolutely not. Security is a process, and this paper is intended to start changing attitudes and approaches based on the tactics that ransomware attackers and cybercriminals demonstrate. Every environment is different, and every implementation has details that will change how an attacker is able to interact with it. Details matter in this business, and for a full double-checking of your approach you should engage a cybersecurity firm that specializes in penetration testing and incident response.

Disclaimer

This document is intended to provide general guidance for organizations that are considering VMware solutions to help them address compliance & security requirements. The information contained in this document is for educational and informational purposes only. This document is not intended to provide advice and is provided “AS IS.” VMware makes no claims, promises, or guarantees about the accuracy, completeness, or adequacy of the information contained herein. Organizations should engage appropriate legal, business, technical, and audit expertise within their specific organization for review of regulatory compliance, security requirements, and effectiveness of implementations.

Extend Your Business Continuity Plans

Ensure that attacks are covered in your organization’s disaster recovery & business continuity planning. Plan for an “everything down” scenario. Proactively engage a security consultancy that specializes in incident response. Carbon Black maintains a list of incident response partners as a starting point, linked from the

Make a plan to restore services and functionality without paying the attackers. Paying the attackers does not guarantee anything:

  • Remember that these are criminals, and only acting in their own best interest despite whatever they promised you.
  • Organizations that pay often find that what they get, like the decryption program for their data, doesn’t work correctly or works too slowly to be helpful. Additionally, can you trust a decryption program from an enemy? Don’t run untrustworthy software without having a professional confirm that it isn’t malicious.
  • Studies have shown that organizations that pay get attacked again, soon after the first attack, by other cybercriminal gangs, as the first group of attackers shares vulnerability information with their criminal partners.
  • Last, paying the criminals funds their criminal activities, further exacerbating the problems. Additionally, there is discussion in federal, state, and local governments about making ransomware payments illegal.

As part of your organization’s business continuity plan your systems should be classified according to their business criticality. This will help you assign correct security controls to systems, assess dependencies, and prioritize work during an incident.

Ensure that contact information and roles & responsibilities documents are stored in a place that will be accessible if IT systems are offline. Many organizations, with otherwise terrific business continuity plans, have found themselves hampered because their plans were stored on systems that were inaccessible because of the outage.

Build & Maintain an Accurate Asset Inventory

In a situation where details matter, and a single device with a vulnerability can open the door to a full-blown attack, it is very important to know what is on your network.

Does your organization have a 100% accurate inventory of devices? Does it include the operating system, patch level, network addresses, and installed software? Does it include infrastructure that may sometimes be invisible to network-level auditing? Does it include an audit against the network equipment’s perspective? A rogue device may not have an IP address, and devices often don’t respond to an ICMP echo or “ping” even if they aren’t malicious. It is often helpful to use a variety of techniques to inventory what is on a network. The ARP table on a network switch, the nmap scanning tool, and even a physical inventory of switch ports can help determine if there are devices omitted, or perhaps hiding, from an asset inventory.

On vSphere, administrators can use PowerCLI to easily generate lists of virtual machines and their properties, as well as ESXi hosts and their properties, too. Carbon Black Workload, which is a terrific tool for assessing vulnerabilities in virtualized workloads, can also be used to generate asset maps and reports.

Some organizations have full-blown Configuration Management Databases (CMDBs), but if you don’t, consider starting out with a simple spreadsheet. The Center for Internet Security (CIS) has a sample asset management spreadsheet as part of their 18 core security controls. It is also worth noting that CIS agrees on how important asset inventories are: their first and second priorities on that list are both about asset management.

Improve Patching

People roll their eyes when they see an item like this, but patching is the only way to remove a vulnerability from an environment. It’s also hard to argue against the idea that patching processes are fundamentally broken in many organizations. Long change control processes for approving security patches are an attacker’s best friend. While a Change Advisory Board is taking 90 days to approve a change, the attacker is using the vulnerability to turn a small malware infection on a desktop into a full-blown attack on your organization’s infrastructure.

Use the asset inventory and the prioritization from the business continuity planning process to determine the level of security needed for systems. Business critical systems, such as identity management systems, should be designed so they can be updated in minutes and hours from the release of a patch. Remember that if a system cannot be patched during business hours it won’t survive an actual outage, either. Patching is simply a small, scheduled test of the resiliency of individual services and systems.

Organizations that use ITIL methodologies to classify changes should consider working to make all security patches either a Standard change or an Emergency change and discontinue using Normal changes.

Standard changes are well-documented, repeatable changes and can be applied at any time, and this approach works well for most security updates lower than “Important” and “Critical.” Indeed, patching should be a process that can be repeatable, reliable, and can be done with minimal disruption through staged rollouts and good system design for availability.

Emergency changes should be used for security vulnerabilities considered “Important” or “Critical.” Often the difference between those two levels is whether the attacker needs to be authenticated or not. In environments that use external authentication sources, such as Active Directory, an attacker may already be able to authenticate. This turns an Important vulnerability into a Critical one.

Acting as if an attacker is already inside the corporate perimeter is an approach many organizations have started using in order to improve their own security design. Assuming the attacker is already close by changes how organizations respond to new security threats and disclosures.

VMware vSphere has many features in it to help turn patching into a low-effort and automated task that can be done during normal business hours. vMotion moves workloads between hosts in a way that most workloads never notice, and vSphere 7 improved it so that even the most sensitive workloads move gracefully now. Distributed Resource Scheduling, or DRS, works with vMotion to help automatically coordinate relocating workloads during patching. vSphere Lifecycle Manager can manage and remediate hosts from the firmware all the way up, making patching something you can start in the morning and walk away from.

Patching vCenter Server is easy, through the Virtual Appliance Management Interface (VAMI). Patching vCenter Server does not bring workloads down but does impact the management interface that vSphere Admins use to interact with the environment. Systems that connect to vSphere, such as backup and monitoring systems, should be set to automatically reconnect and resume operations when vCenter Server becomes available again.

Separate Infrastructure Authentication

From an authentication & authorization perspective there are three big realities that work against us when building and operating infrastructure systems:

  1. Phishing attacks and credential reuse are a common method for attackers to gain access to an organization. Attackers gain control of a user account and then use that to conduct attacks from the inside. vSphere Admins are not immune to this, and the compromise of vSphere Administrator’s account can be devastating. The attacker can disable other security controls, exfiltrate data, and encrypt workloads at will.
  2. Centralized identity providers like Microsoft Active Directory are a major target for attackers. This is not a criticism of Active Directory; this issue exists for any centralized identity provider. In Active Directory’s case, it is a target precisely because it is such a successful way to manage enterprises. Attackers know that if they can break into Active Directory they can often achieve their own nefarious goals much faster.
  3. Infrastructure systems often host the identity providers that they rely on for authentication and authorization, and in the case of Active Directory, often DNS, DHCP, and Certificate Management, too. Controlling these dependencies is crucial for the “everything down” scenario that should be in an organization’s business continuity plans.

vSphere infrastructure should be managed with dedicated administration accounts that are not used for general-purpose computing and administrative tasks by the administration team. These accounts should be isolated from the greater organization’s identity management systems so that a breach of those systems does not immediately compromise the infrastructure itself. This is often done with highly restricted, highly monitored, purpose-built identity management instances inside the secure management perimeter, not tied to the organizational identity management system in any way.

Done this way, an attacker will cause “failed login” log messages as they attempt to reuse stolen credentials. Authentication systems should monitor for both failed and successful logins of administration accounts, and those logins should be audited at regular and timely intervals to detect an attacker who is successful in using stolen credentials. For a separate identity infrastructure, it is often possible to alarm on all failed login attempts.

Attackers who compromise an identity source can often add themselves to authorization groups, and simply log into systems they should not otherwise have access to. Additionally, reliance on central identity systems means that the administrators of those systems are potentially infrastructure administrators, too, as they can add themselves to infrastructure access groups at will. Some regulatory compliance efforts, such as for PCI DSS and NIST 800-171, flag those identity management admins as “in scope” for audits and compliance actions. Organizations that do not wish their domain admins to be storage, firewall, vSphere, or other admins should reconsider the use of domain groups for authorization.

Most infrastructure, including vSphere, allows authorization to be done on the devices or systems themselves. If that is infeasible due to scale then other techniques for separation, and/or automation of account management, should be employed. Though recent attacks that made headlines remind us to protect automation systems as well.

Restrict Access to All Management Interfaces

It has become known recently that some vSphere installations are fully exposed to the Internet. While this makes it very easy to manage those environments, it also means the entire Earth can help manage them, too. This is an extreme example of what not to do when securing your vSphere environment. Practice the information security concepts of defense-in-depth and least privilege and restrict infrastructure management interfaces – vSphere and everything else – to very specific people and places.

We recommend going much further than that, though. Organizations that assume that an attacker is inside their corporate perimeter also add internal controls to ensure that only authorized devices can access the management interfaces and systems.

VMware strongly recommends that ESXi hypervisor management interfaces be on a separate VLAN or network segment, subject to independent perimeter controls and monitoring. Likewise for hardware management controllers that are often found on enterprise servers. These management controllers should be isolated on their own VLAN, away from everything else, and isolated as much as possible from ESXi. This often means disabling emulated USB NICs and other interconnections that could serve as back doors for attackers.

vCenter Server is for vSphere Administrators

vCenter Server and other management infrastructure, such as vRealize Operations Manager, etc. should also be isolated on its own network segment. Access to these network segments should be restricted to only staff with vSphere administration duties.

“What about workload admins?” is a common question. Indeed, virtual infrastructure does not exist to serve itself. It is there to enable workloads. However, every person you authorize into vCenter Server is a path to a potential compromise.

Instead, limit access to vCenter Server to only the vSphere Administrators themselves. Administrator access to the workloads should be through the workload virtual machine’s own network interface, via SSH or RDP. That makes their management traffic and access subject to network intrusion detection and other monitoring systems. It also simplifies the access control for both the workload and vSphere by avoiding the need to co-mingle access requirements. This makes auditing access, monitoring access & network traffic, and ongoing management much easier.

Nobody should have access to the management interfaces of ESXi, except vSphere Administrators, from their own specific workstations. Even then, all vSphere management activities should be conducted through vCenter Server where they are subject to the Role-Based Access Control (RBAC) permission model. ESXi should have SSH disabled and normal lockdown mode enabled, in accordance with the vSphere Security Configuration Guide’s best practices. SSH is a troubleshooting interface, and is shipped in the disabled & stopped state because it is not a supported day-to-day management interface. If SSH access is needed it should be done tactically, enabled for the troubleshooting operation, and then returned to its disabled & stopped state.

The HTML5 vSphere Client proxies virtual machine console connections through vCenter Server if others need access, and that access can be controlled using the RBAC permission model in vCenter Server.

Authentication into vCenter Server can be through vSphere Identity Federation, a feature that enables vCenter Server to use Microsoft Active Directory Federation Services (ADFS) as an authentication source. Nearly all multifactor authentication solutions plug into ADFS, making it a great way to deter credential attacks against vSphere.

Many organizations adopt a mindset for vCenter Server access that is taken directly from traditional data center practices. Most organizations do not allow everyone in the organization to stroll into the data center whenever they desire. Only staff that have a business requirement to be in the data center can enter. The same approach, adapted to vCenter Server, works well for determining who needs access. Practically speaking, workloads do not need to be managed from the console of the virtual machine on a day-to-day basis, and those that do can connect to the console of Microsoft Windows with the “/console” switch for the Remote Desktop client.

If other tools, such as vRealize Operations Manager, have rights to modify virtual environments they should be considered an administration method and protected accordingly. For example, vRealize Operations Manager can have connectors configured to allow the automatic rightsizing of a virtual machine. This is a powerful and easy-to-use management feature, but access to it should be considered carefully. If your organization doesn’t continuously right-size workloads it may be better to leave the connectors configured as read-only.

Ensure Backup & Restore Functionality Works and is Secure

Earlier we mentioned that paying a ransom often does not result in a good outcome for the victim of the attack. One of the reasons for that is because ransomware decrypters may not work, may work extremely slowly, or may contain additional backdoors and malware. We also mentioned that paying a ransom rewards the attackers, which is something we should not do. A way to avoid paying a ransom, and ensuring recoverability, is to have a reliable and secure backup solution.

Backups are likely already part of your IT environment. It is worth considering how they are configured in the context of ransomware, though. Backups are a last line of defense, and it is important that an attacker not be able to delete or otherwise corrupt them. Backup vendors often have best practices for implementing their own software, but we will discuss some ideas and considerations here.

The time between a breach and a full-blown ransomware attack can be quite long. A recent study shows that the average time between a breach and detection of the breach is over 200 days. Attackers are patient and will work to ensure that you must pay the ransom. This means that disabling and/or corrupting your backups is a high priority to them. Organizations need to make it extremely difficult or impossible for attackers to do so.

Over a period that long it is likely your backup systems will be capturing the results of the attack on affected systems. This is an important consideration, because restoring a backup may also restore infected systems, and/or restore systems to a vulnerable state (yet another reason for a timely security patching process). It is worth considering how you would recover systems in a questionable state, as well as how you would assess the reliability of such systems.

It may also be worth considering changes to how systems are configured. For instance, even if the operating system and application files are corrupted by malware it may be possible to remount or reattach the data to a fresh installation of an application. Database servers are a good example of this, where the data files and the application are held separately. Some organizations practicing this ahead of time so that if they are attacked they know how they will respond, and won’t need to learn new things when they are already under stress.

Keep your backup systems separate from corporate identity systems, as well as the separate infrastructure identity systems. Some of VMware’s backup vendor partners strongly recommend never joining your backup systems to an Active Directory domain. Don’t consider joining your backup systems to a separate infrastructure authentication solution, either. If the infrastructure is breached you don’t want attackers having access to the backups, losing your last line of defense and your chances of recovery. Backup systems should be an island, extremely isolated, managed by few staff, monitored heavily, but also protected against outside influence from centralized monitoring and management systems.

For these same reasons we do not suggest installing backup management plugins in the vSphere Client. Manage the backup system through its own interfaces.

Use immutable backups and/or “air gaps” where possible. Immutable backups are backups that cannot be altered, and in the past this has typically been implemented with specialized backup media like “WORM” – Write Once, Read Many – devices. Most current backup solutions also have more modern, software-based approaches to this. Air gaps are where backups are intentionally taken offline, and often offsite as well, to protect against threats like fire, flood, burglary, and such. There are several strategies for this, each dependent on the software and the implementation.

Please note that there may be legal implications to changes in backup strategies, especially considering GDPR and data retention issues. Many larger organizations have specific data retention policies, not just to ensure that data is retained for certain periods of time, but also that it is destroyed when that time is over.

Conclusion

Ransomware attacks are perpetrated by criminals looking to profit from gaps in an organization’s security posture. By changing our assumptions about security, returning to the fundamentals of least privilege, defense-in-depth, and adopting zero trust design principles, we can have more effective defenses against these attackers.

About the Author

Bob Plankers has been bridging people, process, and technology in information security and IT infrastructure operations for nearly three decades. Prior to joining VMware he spent 24 years in enterprise IT, wearing many hats (both figuratively and literally) from support to developer to systems engineer to manager, and enjoys bringing his experience to bear on modern issues in IT. Have a suggestion, a comment, or see an error in this document? Please email him at rplankers@vmware.com.

 

Filter Tags

Security Cloud Foundation Cloud Foundation 3.9 Cloud Foundation 4 Cloud Foundation 4.2 Cloud Foundation 4.2.1 ESXi ESXi 6.5 ESXi 6.7 ESXi 7 vCenter Server vCenter Server 6.5 vCenter Server 6.7 vCenter Server 7 vSAN vSAN 6.7 vSAN 7 vSphere vSphere 6.5 vSphere 6.7 vSphere 7 vSphere with Tanzu Document Best Practice Intermediate