Weblog entry #2 for naoliv
We are having a strange problem here where I work (and I don't know how can I debug this).
Our central switch is a stacked 3Com 5500G-EI SFP + 5500G-EI. Leaf switches are 3Com 2948-SPF, connected to the 5500G via optical fiber.
On one of those 2948, there are more 3Com baselines switches.
It's more or less this (http://people.debian.org/~naoliv/misc/network.png):

The network is "big" (300 machines more or less) (Yes, I know. "Break this network", "Create sub-nets", etc; if everything goes well, we will have a better network structure someday).
But well, what is happening these days is that the networking is stopping. We work at [1], and we can't ping the other machines connected to the same switch at [1] (nor we have communication to the other places). The same happens for people located at other places, like [2].
It seems to be something that spreads on the entire network, but I don't have idea of what could be this.
There are days that it takes only 10 seconds, then everything gets back to normal. Today it stayed almost 1 hour without network. The strangest thing is that it seems to stop around 5:00h PM
Do you have any ideas of what can cause something like that? Worm, somebody using some malicious program, something wrong on a network cable, a broken switch? What can we use to debug this, please?
Thank you very much! Edit: See comment #5 for more info, please.
Comments on this Entry
[ Parent | Reply to this comment ]
Maybe we have found a loop today, but I was unable to verify it.
But the strange thing is that today, at 5:00h PM again, the network stopped. I was able to capture the traffic at my machine (almost 70MB in 13 seconds).
What I saw is a lot (really a lot) of this: a Windows machine, at [2], sending
"Local master Announcement MACHINE-NAME, Workstation, Server, Print Queue Server, Nt Workstation, Potential Browser, Master Browser", with destination "NETBIOS- (03:00:00:00:00:01)"
and a Linux machine at [1] (the same switch where my machine is connected), sending
"Standard query PTR _pgpkey-hkp._tcp.local, "QM" question" to "224.0.0.251".
These 70MB of data is only this. A Windows machine sending this and the Linux one "answering" (seems to be from avahi this string). (We have disable avahi on this machine; we still needs to look at the Windows one. But why is avahi getting crazy like this? Why is Windows sending this amount of traffic?).
Do this still characterize a loop?
After 15 to 20 seconds it stops and the network gets back to normal.
Thank you!
[ Parent | Reply to this comment ]
Back when I worked at Altus (www.altustech.com), we used to sell something called "Statscout". It could dig out snmp stats from managed switches and routers. I'd gather up some stats for a week or so and see what switch ports are really chatty around those times.
[ Parent | Reply to this comment ]
Are you running some kind of spanning tree ?
Because a network loop can cause such downtime. For example, what is the load on the switches when there is a downtime: do all the light blink like crazy ?
[ Parent | Reply to this comment ]
A network loop would be one of the things I'd check for first also. This kind of symptom is pretty common if you're not running some form of spanning tree and some idiot just plugs both ends of a network cable into two ports on the same switch, or different switches on the same collision domain...
Cheers.:wq
[ Parent | Reply to this comment ]
After disabling avahi (probably unrelated, but well) on the Linux machine and removing 3 loops that we found on the network, the problem seems to be solved.
Thank you all for your help and attention!
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]