6 Jun 2010

Working with Domain Member Virtual Machines and Snapshots

One of the benefits of using a virtualization product that allows you
to create snapshots, is the ability to create a "point in time" to
which you can always revert your virtual machines. By reverting to
this snapshot, you get your VM to the state in which it was saved, and
are able to perform various tasks such as testing software, doing QA,
creating labs and so on.

However, one of the nasty issues of working with snapshots is when you
have one or more virtual machines that are members of an Active
Directory domain. When you create snapshots of such machines and
restore them, you might occasionally find that all authentication
involving the VM seem to fail, and face an issue of not being able to
log on to the virtual machines, or not being able to access files and
shares across the network. You might even get errors like this one:

Windows cannot connect to the domain, either because the domain
controller is down or otherwise unavailable, or because your computer
account was not found. Please try again later. If this message
continues to appear, contact your system administrator for assistance.

If you log on locally (not using a domain account) to the computer
(in this example it's a Windows XP Pro client), you'll see the
following events in the Event Viewer.

NETLOGON 3210

This computer could not authenticate with
\\WIN2003-SRV1.petrilabs.local, a Windows domain controller for domain
PETRILABS, and therefore this computer might deny logon requests. This
inability to authenticate might be caused by another computer on the
same network using the same name or the password for this computer
account is not recognized. If this message appears again, contact your
system administrator.

LSASRV 40961

The Security System could not establish a secured connection with
the server cifs/WIN2003-SRV1.petrilabs.local. No authentication
protocol was available.

W32Time 18

The time provider NtpClient failed to establish a trust
relationship between this computer and the petrilabs.local domain in
order to securely synchronize time. NtpClient will try again in 15
minutes. The error was: The trust relationship between this
workstation and the primary domain failed. (0x800706FD)

And possibly others.

This is nasty, however, if you carefully remember the days when Ghost
was the only way to image a computer, you might also remember that it
was always a good practice not to "ghost" a machine that was a member
of a domain, and that if you didn't do that, you ended up with a
cloned computer that was "ghosted" back from an image, and that, in
some cases, could not log on to the domain it was a member of. So this
is not a new situation, it's just the new "ghosting" tools we're
using.

The reason for this is that there is a computer account password
mismatch. The Windows-based domain member VM thinks that its machine
account password is something X, while the domain controller believes
it to be something Y. Because of this, the VM cannot authenticate
itself to the domain controller(s).

So how does this work? Just like user account passwords, computer
account password is a "secret" that is set up by the computer account,
and that is used when a Windows-based domain member computer
authenticates itself to the domain controller and establishes a secure
channel.

When the computer is started, a service called NetLogon uses the
machine account password and tries to establish a secure session with
the domain controller. The usual CTRL+ALT+DEL Winlogon process also
relies on this authenticated secure channel to send user credentials
to the domain controller for verification and log them into the
computer. Other services running on this machine that work with the
LocalSystem or NetworkService credentials also require this
authenticated secure channel to get access to domain resources.

So without this proper password there cannot be a secure channel, and
hence the issues described above, and various things fail as a
consequence.

The password is first created when the computer is joined to a domain.
It is shared by domain controller and the computer.

So what happens during regular operations? Well, to explain this, we
need to think or 3 scenarios:

1. Regular operation, client computer works "regularly", never offline
for extended periods of time. Each Windows-based computer maintains a
machine account password history containing the current and previous
passwords used for the account. Regularly, the computer account
password change is initiated by the Netlogon service on the client
computer every 30 days by default . Since Windows 2000, all versions
of Windows have the same value. After this change, both the domain
controller and the computer use the new password for authentication.

When a client determines that the machine account password needs to be
changed, it would try to contact a domain controller for the domain of
which it is a member of to change the password on the domain
controller. If this operation succeeds then it would update machine
account password locally.

When two computers attempt to authenticate with each other and a
change to the current password is not yet received, Windows then
relies on the previous password. If the sequence of password changes
exceeds two changes, the computers involved may be unable to
communicate, and you may receive error messages.

2. Not-so-regular-operation but still possible, when a client computer
is taken offline for an extended period of time, 30, 60, 90 days or
more. In this scenario, if a computer is turned off for three months
nothing expires. When the computer starts up, it will notice that its
password is older than 30 days and the Netlogon service on the client
computer will initiate action to change it. This is only applicable if
the machine is turned off for such a long time.

3. Snapshots, when either a "Ghost"-type image or (related to this
article) a VM snapshot is taken, then the computer resumes regular
operation (as of scenario #1). Then suddenly, after working for 30,
60, 90 days or more, the snapshot is brought back. While using
snapshots, when the domain member is restored to an older snapshot, it
loses track of any password change changes done later and tries to use
an older password. Hence it fails to authenticate itself.

So how do you fix this? Well, first of all, if you've already gotten
to the point where the error occurred and you cannot log-in, you will
need to read my Fixing "Windows cannot connect to the domain, either
because the domain controller is down or otherwise unavailable, or
because your computer account was not found" Errors article for a
solution.

However, if you wish to prevent this from happening AND you're using
virtualization software and snapshots, you may want to do one of the
following:
Option #1

Increase the computer account password age, or disable password
changes altogether. Both these can reduce likelihood of the problem,
but may reduce the level of security in the domain. On the other side,
since this is probably a test, a QA or a demo environment, you may
consider it as a valid option . These settings are available on the
domain member (and not on the domain controller), and as such, you can
change them on your computer before you create a snapshot out of it.
Warning!

This document contains instructions for editing the registry. If you
make any error while editing the registry, you can potentially cause
Windows to fail or be unable to boot, requiring you to reinstall
Windows. Edit the registry at your own risk. Always back up the
registry before making any changes. If you do not feel comfortable
editing the registry, do not attempt these instructions. Instead, seek
the help of a trained computer specialist.

As noted above, these settings are configured on the domain member,
and are controlled by the Netlogon service. Settings are found in the
following Registry key:

HKLM\SYSTEM\CurrentControlSet\Services\NetLogon\Parameters

DisablePasswordChange (default off) prevents the client computer from
changing its computer account password. To disable, give it a value of
1.

MaximumPasswordAge (default 30 days) determines when the computer
password needs to be changed. Change it to whatever number of days you
think may be enough. For example, if you use snapshots that are less
than 100 days old, then you can set this value to 100 or similar.

Settings can also be configured by using Group Policy (either
domain-based GPO or local):

Computer Configuration\windows Settings\Security settings\Local
Policies\Security Options

Domain member: Disable machine account Password changes

Domain member: Maximum machine account Password age

After making the changes, reboot the client computer(s), and then
create a snapshot, if you need one.
Option #2

Live with it, know it's an issue, fix it every time. It's time
consuming, sure, but it's probably more secure than option #1. Read my
Fixing "Windows cannot connect to the domain" Errors article.
Option #3

If these VMs are used for testing, QA, demos etc. you could consider
creating a "closed" environment, where not only the client computer
has a snapshot, but also the domain controller(s) have one. When you
revert to a snapshot, you also revert to the same snapshot level on
the DCs, all of them, at the same time. For some settings this may
actually be a nice setup. However, if you cannot create such a setup
then you're probably have to either go with option #1 or #2.

No comments:

Post a Comment