Saturday, August 20, 2005

Major disruption in internet service at LANL on Friday

From Anonymous:

There was a major disruption in internet service at LANL on Friday. I'm told it was because of a worm attack on unprotected Windows systems. Can anybody supply details? Rich Marquez would like to make LANL an "All Windows, All the Time" operation. Friday is a glimpse of LANL's future, if Marquez is still around to implement his plan come December.

Rich Marquez does not impress me as a knowledgable computer person at all, based on the current mess that is the enterprise project. People are finding it impossible to do formerly simple tasks, including ordering safety equipment.

I know a few folks who are buying safety equipment out of their own paycheck because the enterprise project has made it impossible to have LANL buy it. I wonder, if someone gets injured because of an inability to get safety equipment, will we finally take a better look at this situation?

Anyone who is going for "windows everywhere" has their head in the sand. The irony here is that while Oracle is making a huge Linux/Unix push, LANL is wiring up the Oracle-based enterprise project so that it works only with Windows! What decade are these people living in? "Windows everywhere" is thinking from the previous century!

Note I am NOT saying "linux everywhere". That would be a mistake too. It would be nice, however, to get somebody in charge who understands that we live in a multi-platform world.

I hope LANL is going to take a hard look at the enterprise project before it causes too much more damage.
8/20/2005 02:26:26 PM said:

"I know a few folks who are buying safety equipment out of their own paycheck because the enterprise project has made it impossible to have LANL buy it. I wonder, if someone gets injured because of an inability to get safety equipment, will we finally take a better look at this situation?...

I hope LANL is going to take a hard look at the enterprise project before it causes too much more damage."

It is against DOE regulations to use personal property for Lab business, even for safety reasons. As stupid as the requirement is your friends could get fired for trying to protect themselves.
I have to comment on the enterprise project. This is truly a disaster for which they will probably give someone a distinguished performance award. I used to be able to walk a PA through for all the approvals in a day. Now the enterprise project and its associated operational efficiency program have expanded that to 2 weeks.

This and the fact that purchasing safety equipment went from being a pain to being impossible is making another multi-million dollar program on the business side of the lab a joke. Someone ought to be fired.
SO what happened Friday that brought down many networks and the Enterprise apps for much of the workday? The comments here don't touch the original question, only bash Marquez, Windoze, etc.

I heard that many Windows 2000-based servers were hit by a sudden virus, which came through Thursday night, and that it wasn't something that Microsoft warned about. I would be surprised, no not really I guess, if there wasn't more coming on this.

Take a look at your Lab newspaper Friday, one of the awards was given to the team which does quick response to computer emergencies such as this. I guess this one slipped under their radar. I wonder if DOE also got hammered? Why are servers running W2K still?
The attack started about 8am on Friday and overloaded the network causing problems (overloaded processing) in the routers. I've heard that it only affected Win2K systems. XP systems were not affected (at least this time). By noon the problem was resolved and the network was back up due mostly to dedicated people in CCN working under a stressfull condition. How the virus got into the Lab is the big question and will take a while to determine. Right now it's too easy to bring an infected laptop inside the Lab and connect it to the internal network. Until a strong IT policy against such connections can be deployed LANL will continue to see problems like this whenever a new worm or virus gets out. Several well recognized companies (CNN for example) were also infected by this.
There has been a strong push at LANL to centralize ALL computing support. Some divisions have had "shadow" organizations to provide computer support and these are slowly disappearing.

The push for centralized computer support has always come from some empire builders in the computing division and got very strong support from Nanos.

Indeed, we have been our own worst enemy by allowing PC/WINDOWS, MACINTOYS, and LINUX systems often at the sole discretion of individual users. And, as was noted in a previous comment, it is not possible to prevent an unauthorized laptop from being plugged into our networks.

But, centralized computer support is very difficult to implement and manage. LANL has a diversity of NW and other programs and one size will not fit all. (Nano's solution was to eliminate everything but NW.) I have some experience with peripheral controlers for which WINDOWS/XP drivers are not available and thus we had to run WINDOWS 2000 on these systems and NOT connect them to the network.

I don't have a solution here except to suggest that Group level managers take an active role in assuring that the members of their group are behaving responsibly. In our Group, the GL allowed anarchy in computing because he was unwilling to make anybody unhappy.

By the way, what is coming next is centralization of the engineering function at LANL. This is being strongly pushed by people from FM and PS Division which, as we know, are the repositories of the chronically unemployable!
The outage on Friday was related to the Zotob worm. It got in as all worms get into firewalled networks. By the intrepid users bringing their laptops to work from home.

Using the CCN-2 [1] managed desktop services, this vulnerability was patched on Friday the week before. If you received the SMS push of the patch, you were not affected. The problem occurred due to the following:

* Systems were not patched. [2]

* The threat was underestimated by the CCN-5 security team. [3]

* The routers the lab uses can be found on aisle 23 at Wal-Mart. Make your own interpretations on quality.

There needs to be strong lab policy mandating good systems administration with performance and fiscal penalties if these are not adhered to. Personally, I'm hoping Lockheed Martin will bring this. It's clear that the UC management has no interest.

[1] CCN-2 should be interpreted as CCN-DC (departmental computing). This organization was recently reorged, it made management happy.

[2] CCN-2 did a good job of managing their systems. Outside organizations did not do this in a timely manner. Generally CCN-2 does a good job of this.

[3] The vulnerabilities needed null-sessions, or some other method of authentication.
The poster at 8:21 am has no idea what he is talking about concerning his comments on the routers. Why make such a ludicrous comment? Does s/he even know what kind of routers LANL uses? I suspect not.
Kind of stupid to only have a firewall at the head end connection to the internet. Would make sense to have distributed hardware firewalls at various and numerous segments of the LANL network. At least this way if a laptop user connects an infected laptop to the yellow you can isolate the threat and not have it run rampant all across the laboratory. It CAN and SHOULD be done. Forget all the other hoo-ha thats being done. Better yet, our Mac OS X servers hummed right along through all this without even a blip. Windows sucks.
This post has been removed by a blog administrator.
Anyone who remembers me from the Network Managers Users Group(NMUG)is aware that I know nothing about systems management, so I am not afraid to ask this question:

Could the worm have gotten into the network if every computer connected to it had a firewall?

Larry Creamer, DX-1 Retired
Good question. The answer is hopefully "no" but until CCN determines how it got in we won't know for sure.

Of course the other problem is how to insure that all computers have firewalls. CCN is trying to solve that problem but it's not easy.
Regarding firewalls, if you run something like BLACKICE, then CCN cannnot scan your computer.
Anonymous 8/22/2005 06:37:03 AM said:
"Regarding firewalls, if you run something like BLACKICE, then CCN cannnot scan your computer."

I've often wondered what all the scanning has accomplished. Maybe it is time to trust people and protect the system.

Larry Creamer, DX-1 Retired
The scanning had some very positive results in that it found those idiots who were NOT running antivirus programs, etc.

However, very often the infected computer was disabled from connecting to the network at the router and neither the user or the respective group management were notified. So, very often, the user spent days trying to find the problem with the computer. It is this kind of arrogant, inconsiderate behavior on the part of CCN that causes many of us to oppose centralized management of the computer system.
Surprised that this did not make the national media as well as CNN/Europe.
The reason CCN scans is because of a finding from DOE that told us we had to in order to detect vulnerabilities. Because these vulnerabilities (like last Friday) are so pervasive at time it is difficult to control them when you have 15000 systems on the network. The only way to be partly successful is to automatically block vulnerable systems that are detected. Of course once you block the system there's no way to notify the user of that system. What folks are supposed to do when their network is not working is to call LANL's Network Operation Center. There network personnel can tell you if you have a real network problem or if you are blocked because your system was vulnerable. No one shoule spend time trying to fix their own network problems without consulting the NOC first.

I realize this is all onerous to most LANL users but it's a different environment today than it was say 10 years ago. Friday's episode and the few in the past year or two show why this is so. I wish every user was knowledgable enough to protect their own system but they aren't. And saying we should all move to Macs or Linux is not a reasonable answer for LANL at this time.
I do not accept that there is "no way to notify the user." CCN has the IP address. If registered, that should have the user's name, Z#, phone number, etc.

If not registered, the IP address gives the subnet which narrow it down. As simple Email to division.all would provide notification.
"And saying we should all move to Macs or Linux is not a reasonable answer for LANL at this time."

I was with you until this last bit, 8/22/2005 07:52:05 PM. Linux and Mac OSX are the two platforms that are the most secure, resistant, productive alternatives to Windows that exist. To say that they are not acceptable solutions to LANL is to say that LANL has an IT operation that screams for new leadership.

Apple has always been focused on proprietary models, and jumps from product line to product line, always looking for the lock-in, high-profit winner. Newton was a failure, but iPod is a success. Jobs has always played the brinksmanship game, which doesn’t make executive-level IT pros very comfortable. Apple isn’t really interested in the enterprise market. For example, there are no enterprise-class servers bearing the Apple logo. Apple's total gross revenues, for both hardware and software, are a tiny fraction of Microsoft's. This picture won’t change anytime soon.

Linux depends heavily on the open source world, and most IT executives are finally acknowledging what most real coders knew all along; just because the source is open doesn't mean that anyone with real expertise has actually looked at the code. SourceForge is flooded with 0.x implementations that died on the vine because the two-guys-in-a-garage model depends on the two guys being geniuses. By definition, geniuses are rare. Face it: most enterprise Linux implementations are supporting an Apache server. This is not exactly a broad-spectrum business model. It will take many years for the emergence of a truly viable alternative to Windows.

Even if we disregard all of the above, a switch off of Windows just won’t happen at LANL. UC doesn't count, but Bechtel is a Windows shop. LockMart is a Windows shop. So any "new" IT leadership won't be interested in doing anything else, no matter how "secure, resistant, and productive" your alternatives are.

You make some good points, Kirilan, especially the ones about Bechtel and LM being Windows shops. The fact is, though, that IT is changing, and enterprise is increasingly moving to non-MS solutions because of cost and security reasons. I won't cite examples; if you are interested you can read about enterprises developing new non-MS enterprise deployments daily in publications like and Slashdot.

That argument that "UC/Bechtel and Lockheed Martin do it that way" is not a good argument to continue "doing it that way". The rest of the world is changing.
Do not be naive! MacIntrash computers are not more secure. It's just that with only about 5% of the market, the payoff for writing a virus or worm is just too small.
A common (if foolish) misconception. The BSD Unix kernel of Mac OSX is inherently far more secure than Windows. Likewise Linux, Solaris, AIX, HP-UX, or any other Unix.
Most of the virus/worms have gotten into the lab because of untrain(ed/able) users. Most of the virus infections have come from the same people each time.
1) User brings in laptop from home without running antivirus/firewall.
2) User uses outside email and downloads the virus by circumventing the LANL firewall.
3) User/sysadmin does not follow LANL cookbooks because CCN wrote them.. thus null sessions and other things 'prohibited' by them are left on.
4) Group sysadmins turn off scanning from CCN-5 because it impacted their work. This means they scan ok, but get hammered by worm.
5) People get a slap on the hand at most for repeat violations. Having had to clean one system 5 times because the user is too important to say no to is not uncommon.
6) LANL Cybersecurity has no teeth, and leaves the enforcement to CCN-5 to get boxes off of the network. This is yet another LANL passive aggressive arrangement where people are set up against each other versus trying to work together.

Personally, I am glad I dont have to police the place any more.. you get no credit, no overtime, and people get smeared for any problems with throughput or downtime.
The cure for this problem is to embarrass and shame those whose careless
attitudes harm the rest of us on the network. How about seeing some names
of these modern day "Typhoid Mary's", be they low level support people or
even ADs at the top?

Every Window's PC at LANL on the network should be running three programs:
(1) Anti-virus software, updated frequently, (2) a software fire-wall,
like Zone Alarm, and, (3) Anti-spyware software. You can currently get a
very nice Anti-spyware product from Microsoft for free. Without these
three protections, it is only a matter of time before your system is
compromised if you run WIndows. Of course, if the Lab pushed Mac OS X
as a desktop solution, none of these discussions would even be necessary.

It should be clear from the latest hacks on supposedly "secure" WinXP SP2
that things will only get worse for Window's users. You'll be needing all
the CPU cycles you can get in the future just to protect your system from
increasingly vicious attacks on this particular OS.
I would love to see a Typhoid Mary list.. I know I would have been on it at least once for forgetting about a system that was under my name but should have been on another. I was negligent in reassigning it to the student and should have gotten the black eye.

The problems are that CCN-DC has solutions to all these problems but various departments do not use them because of 10-20 year grudges with people who retired a long time ago.

Norton Antivirus SAV-10 has spyware protection

There is a corporate desktop firewall solution that is ready for deployment.

There is a patch management system with SMS that works in most cases.. but like any software solution can go wrong every now and then. People seem to expect computers to be perfect but forget that they are controlled by the most fallible unit.

anonymous because when I put my name down.. I get phone calls from people who called me a nanite for being neutral on him.
I am currious how many of these bloggers actually work in the computer security fields. Lets Look at the facts.
#1 Mac OSX is by far the least attacked system on the internet. However, least attacked does not mean most secure It means that no one cares about attacking a 5% market share, especially when the majority of that 5% is home users (not pay dirt). Hackers want to get the big dogs, and the big dogs just don't use macs.

#2 Linux/unix/??IX is not AN operation system, it is a whole world of different systems. To blanket them all in the same security model is ignorant.

#3 While there are more attempted attacks against LANL machines in the windows platform, there are almost 2X as many successfull attackes against Linux/Unix machines. The reason is three fold. First, it is far more difficult to secure a linux machine, and few people have the skill to do it. Second, most of our unix/linux machines are administered by local organizations who don't follow the models set up by CCN, models based on industry standards. Third, Security updates are not provided on a regular basis.

#4 There has not been a single successful malware (virus, worm, trojen, etc.) attack in over 5 years against the windows platform when the machine has been up to date on it's security patches. The ant-windows people like to call windows insecure because microsoft regularly puts out patches for yet another vulnerability, but the fact is this is what makes Windows so secure. They actually fix thier problems. Mac has finally started to put out service packs (they may call it MAC osX tiger, or panther etc. but the reality is your getting osX service pack 2, call it what you choose)

Those things said let me say that I do not want to see a windows only world. There are times and places where other systems are simply better suited for the needs of the Lab than windows, but for the typical desktop user, windows it the way to go.
Lets look at what actually happend friday, and why it happend.

There were a few reasons for our network outage.

First, CCN runs a program called tipping point, which runs constantly on the lanl network. It's job is to analize network traffic, and when it detects an infected computer, to block that computer from the network, effectivly stopping the spread of the virus/worm. someone at CCN was mis-informed that this particular worm was able to spoof it's IP address (which would cause tipping point to block the wrong person) so to prevent blocking the wrong person they chose to turn of the program. This allowed the worm to spread.

#2 Because the Lab is still working on bringing everyone into the centrall administration of CCN there were some systems which failed to meet the CCN requirments for security. No CCN managed machines were infected unless the user had bypassed CCN safegaurds placed on the machine. The worm only infected Win 2000 and Win 2000 server. Many organizations who have chosen to maintain (or attempt to maintain) thier own computer systems had failed to keep thier computers updated with security patches. This is another good reason for centeralized management of computer resources. (Everyone may now cringe becasue I said such an evil thing. I realize that there are downsides to central controll, but it is worth the trade off)

#3 Someone with to little knowledge and too much authority decided that it would be a good idea to use the central router to monitor the traffic of this worm.(by the way walmart doesn't sell routers that cost anywhere near the many thousands that this one does) That was stupid to say the least. The router was instantly bombarded with far more information than it's processor could handle, and it crashed. (and no it isn't running windows) This is when everyone's network stopped working.

#5 CCN had a lessons learned meeting to discuss the probelem and the solution. All blocked computers were unblocked due to unreliable blocking during the router failure. The router after an hour of rebooting and reconfiguring was brought back online, and tipping point was turned back on.

Finally, by about 2:00PM the worm had been contained, all infected computers had been blocked from the network, dozens of sys-admins were pulling thier hair out tring to get their servers up agian because they didn't dare let CCN keep them safe, and the majority of LANL got to work again.

Findings. CCN's system worked beatifully. It contained the worm, immunized 98% of LANL systems before it ever arrived and would have kept the network running without a hitch. However, human beings messed it up. CCN is partly to blame, they had some people make some dumb mistakes. Other organizations who refuse to use CCN's tools are also partly to blame, there wouldn't have been an infection otherwise. Mostly, some idiot with a laptop is to blame because he or she brought this whole thing upon us.
Let me see if I understand this disruption. The Lab firewall protected against the worm. Properly patched operating systems, on individual computers connected to the network, protected against the worm. Up-to-date virus protection systems, operating on those computers, protected against the worm.

The worm wriggled in because an unprotected laptop computer containing the worm was connected to the network inside the firewall. If all computers inside the firewall were properly protected, nothing would have happened.

The answer to your problem is simple. All you have to do is to ensure that every computer that is connected to the network behind the firewall is completely protected. Sounds simple doesn't it? Just be prepared to handle the breech as well as it was handled last week. It will happen again.

Larry Creamer, DX-1 Retired
Some clarification to the poster at 10:40:13am. CCN has several systems (not programs) to protect the network. Tipping Point is a network appliance described as an intrusion prevention system (IPS) that attemps to block viruses, worms, etc, from entering the network if they got past the firewall. Tipping Point does not block individual computers. That system is a totally separate system and, yes, it uses information from the routers to identify where a vulnerable or infected system is and, once detected, will send a command to the edge switch to turn off that particular port. There were lessons learned from the Friday incident including addressing the impact on the router monitoring techniques. CCN will make these changes to better manage such attacks next time there is one. And, yes, there will be a next time because of people bringing in infected systems and connecting them to the network.
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?