A story of a project: 3600 users to G Suite in 60 days! – Day 4: Starting up migration, the NDRs mishap!
This day is a weekend.. Although I put it clear to everyone around me that I never work in weekends.. this was no ordinary time to apply this rule, so we did work on this day!
The main tasks that we needed to accomplish this day was to prepare the migration computers (VMs on GCP), and sorting out the user groups to be migrated (based on domains)..
I started out the work by creating the virtual machines on Google Cloud Platform, and I installed CloudMigrator on each one, having already requested the licenses, I made sure they were available easily on each migration machine, as I was trying to make things simple and getting to anything related to the project easy from every machine.. because I was expecting a big mess to come later on when we start the migrations (and it did indeed come!).. I ended up with 11 VMs..
Having agreed previously about the migration plan, I started to group the users who have more than 20 GB of mailbox size and distributed them on the machines, and I started the migration for one group as a test.. I was supposed to get all other machines running after it, but something happened that put the plan on hold.
As i was a weekend and after I started the first test VM, I went out with wife and kids to send sometime… I suddenly received a call from on of the customer’s IT guys that there is an NDR issue going on with their VP.. And we need to fix it ASAP… Being in the middle of the mall, with no laptop and not even a charged up phone, I could not do anything with what I had, so I rushed back home again and sat on the PC…
Turns out the problem is one VP is sending an email and he gets an NDR… We checked the settings on Office 365 and the destination mailbox was there, and we checked Google and we found the mailbox was there as well… But what we did find also:
- There was a new sub-domain that we created earlier for dual delivery purposes, and that domain is already active and has no problems..
- The user accounts however have NO ALIAS email under that above sub-domain…
- Although we have configured the sub-domain to be automatically assigned to each user account on the control panel!
- After checking the logs of GCDS, we found out that we actually did not create exception to exclude anything under that sub-domain from the sync process, so it got removed from all users as a result of it not being in their proxy addresses AD attribute!
- Now all users have no alias that can be used for dual delivery and coexistence. All users’ email messages are being actively forwarded from Office 365 to this non-existing alias. All users are/will receiving NDR when sending emails to each other!
The impact of this was large (as some users already sent messages to groups), but we knew that emails are not lost, and they were already delivered on Office 365, at that stage, no one moved to Google yet, so there was no email drop or service interruption. We just had to clear it out to users that this is a mis-configuration and a project issue, which had no effects on their data integrity.
After finding out the issue, fixing it was simple, but slow…
We first had to add that missing exception in GCDS, then we came to the issue of restoring the deleted alias to all users… Because we had 2000+ users on the control panel at that time, it was an impossible option for us to wait the auto-reassignment to happen again, so I needed to do things manually again (yay..)
First, I deleted the alias domain, and I re-added it as a separate secondary domain under G Suite (it was added as alias domain before). Then using the magic of GAM, I made a little batch file that will assign the lost alias to each user again…
Because of the affected users were some CxOs, VPs, and other whiny users, I made sure that those users were on top of the script file so they get their alias back again…
Having left my wife and kids out, I went quickly and picked them up and returned home… The script was running, but the emergency was over as it was getting night time and mail flow started to drop…
It took a while but the script finally completed, and I did a quick check to make sure no one missed his alias address… Then went back to the migration machine to check on the progress.
As I expected the migration was much better and went smoothly, but it was too late and I was tired so I moved the remaining tasks to the next day which was also the second day of the weekend… But I knew what I needed to do and what to focus on, so I was not worried about losing time working in a weekend.
This NDR issue was the first of multiple issues we face as a consequence of leaving out the GCDS documentation…
My notes and lessons learned
Documentation is critical, I know this, I do document all my work everyday. But sometimes things just happen no matter how careful and organized you are! Having not missed out the documentation work for the previous work on GCDS and G Suite Admin Console, I would have noticed that there is something missing related to the alias domain that need to be put in GCDS… But I did not!
I also hate to work, and make others work with me on weekends or after work hours.. But once you commit to something you have to feel the responsibility about that… I was working on a weekend not because I was asked to do so by my boss or the customer, and not because of the critical situation that we had for the NDRs… I did it because I was the one who was guiding that specific stage of the project.. And if something goes wrong, I cannot just leave it until the next first workday to fix it otherwise I will fail, and the whole project will get to an epic fail before it even seriously start.
I also felt how sometimes managers can be a source of distraction or issues if we did not know how to deal with them… As that VP who thought that his emails are getting lost, all what it took from us to calm his nervousness down was to prove that his emails are not lost by a small quick test, then he realized that Ok this is just a project problem that is being worked on… Same for other complains that they received…
There was something I felt missing from the customer side, which is not everyone in their company was ready for the project… yes the IT were ready for it, but other departments, management, and end users should be aware of what is going on and what is the work that is being done and what is the meaning of it! Of course if we look at the time frame and the situation the project decision was taken in, I might understand why that did not happen.
It is also important that we work with good qualified people, people who understand what is going on and try to be effective and helpful/resourceful in times of need and emergencies instead of being whiny, lazy, and scared, the good team will make the project a piece of art, but bad team will just ruin the project, and the lives of everyone involved in that project.