What a day! When I got in to the office this morning I had a call from a client that their web site seemed to be down. I checked the site from my computer which is on the same class C and could see it without any problem. However, from beyond our router the site couldn’t be seen. We were in the process of moving this particular client’s sites to a new server so I assumed I must have hosed up some config files on the server. The server was running Redhat 6.2 and had 5 IP addresses configured, from .80 to .85 on our C block. It turned out that packets going to or from those address could not pass from our network onto The Internet. I could ping the outside world from my workstation at .153 or from our name server at .2 but not from any of the IPs on the clients machine even though they’re all in the same class C. Now, the weird part was that I could ping the client’s machine from inside the class C.
This had me baffled for a while. What could suddenly make several IPs in the middle of a Class C block stop working? Nothing in the configuration of the router had changed during the last 24 hours but the logs clearly showed that those IPs ceased to be routed at about 2am last night. It had to be something on the server itself, I decided. I completely reconfigured the networking from scratch several times with no results. Then, just to make sure I wasn’t going insane, I tried configuring it with a completely different IP address. It worked! So, it wasn’t the server at all, there really was a hole in our class C. I started checking logs on other servers and determined the problem was much more widespread than just the one server. It turned out that about 60 of our IPs had gone dead and quite a few of those were customer sites.
Then it hit me. It had to be the router on the upstream side of the T1 at Verio. Only Verio could screw something up this bad. Unfortunately, it was afternoon by now and all the local Verio people I knew were gone for the day. The only thing left was to call the Verio support center. This is not something I enjoy. Everyone hates having to talk to tech support but Verio’s level 1 makes most tech support people seem like geniuses. I believe Verio requires each applicant for a position in tech support to demonstrate complete lack of knowledge in a wide range of fields including The Internet, routers, TCP/IP, and protocols. At the same time, they have to demonstrate an absence of common sense and an inability to spell or remember proper names. Having English as a second language and the ability to speak way too quietly to hear are also valuable skills you’ll need to work in Verio level 1 tech support. But I digress…
I called, waited on hold, and eventually got someone on the phone. After a lengthy conversation in which I spelled my name and the company name several times and assured them that, yes, I really was one of their customers, I got to explain my problem. After just a few repetitions, he was ready to go to work on it. The conversation went something like this: (but to be totally realistic, insert a line after each statement made by Verio in which I asked him to speak up so I can hear what the heck he’s saying.)
Verio: If you’ll give me your IP address, I’ll check and see if it’s working.
Me: I already know some of the IPs aren’t working, that’s why I called.
Verio: What’s your IP address?
Me: There is a range of IPs in one of our C blocks that seems to be dead, do you want one of the IPs that is working or one that isn’t?
Verio: I will ping your IP and see if it’s working, what is it?
Me: xxx.xxx.xxx.84 is one of the IPs that’s dead, is that what you want?
Verio: Hmmm… that IP doesn’t seem to be working.
Me: Yes, that’s why I called.
Verio: Do you have more than one IP?
Me: Yes, we have several C blocks, but only one is experiencing routing trouble.
Verio: What’s the IP address that has trouble?
Me: The one you just pinged is experiencing trouble, do you want another one?
Verio:How many IPs do you have?
Me: We have several C blocks, but only one is experiencing problems
Verio: 1 C block? How many IP addresses are in that block?
Me: 255, the same as any C block.
Verio: What’s the range of IPs in your C block?
Me: All of the IPs in the C block are ours.
Verio: What are IP numbers in your C block?
Me: They range from 0-254
Verio: What’s the IP of your router?
Verio: Ok, I can ping your router, so the problem must be at your end, is there anything else I can do for you today?
At this point, I tried again from the top and explained that we had “many IP addresses” and some of them had suddenly gone dead. I explained that this was due to a routing problem in either my router or their router and that since I had already checked my router, their’s was the likely source of the problem. At this point, he asked for the password to our router so he could check and see if it was the source of the problem. I repeated my previous sentence and he seemed very confused but promised he would “look into it” and call me back, possibly by Monday. I explained that we had customers who were down, and the problem needed to be fixed immediately. He promised they’d look into it right away and gave me a case number but wasn’t sure when they would be able to call me back.
I hung up and played a round of Robotron in the game room to vent some frustration. While I was contemplating the problem, I suddenly realized what Verio had done. The exact number of IPs that had gone bad was mostly likely 64 and the reason had to be that someone at Verio had created a subnet out of our class C and assigned it to some other customer. It was about an hour later now and I called Verio back. This time I got someone who actually spoke English and talked like a regular human. I explained to her that I knew exactly what Verio had done and got her to type my theory into the notes on the case history. I explained again how many of our customer sites were down and how important it was that the problem was fixed tonight. She elevated the problem to level 2 tech support, who she promised would call me shortly.
After about an hour I called back and asked what happened to the level 2 call. She checked the case and said they had assigned someone to work on it and had verified that my theory was correct – they had accidentally taken 64 of our IPs, and assigned them to a DSL customer last night. After another hour I got a call saying they’d put in a static route as a temporary fix and would have a permanent fix by morning. Sure enough, we were up again. They said this was very unusual and probably wouldn’t happen again. Great, I’ll add that to the list of major screw-ups that probably won’t happen again. Like their mishandling of secondary DNS support for our domains, or the alleged fiber cut back in November, or the comedy of errors we went through getting the T1 installed in the first place (there’s enough material there to write a whole book) and there was the infamous Verio Focus Group experience too. Well, enough ranting for today.