I am putting this info in the forums in hopes that it will help someone if they run into the same issue.
Here is the network environment that was setup. It was a free open wireless network at a NON-For-Profit hospital. This hospital has +450 beds and has loads of people going in and out. The hospital gives free internet access to anyone in the hospital as it is a way to have the patients, families, and even doctors and nurses ways to communicate. This internet is through wireless with no password.
When this was fist started in 2007 it was good the hospital took the data and just sent it to the ISP. There was not any issue until the hospital started getting letters from most TV and Motion Picture companies telling us to stop downloading copyrighted material. At that point our lawyers jumped in and told us we needed a captive portal page to keep the hospital from being sued for things that the hospital as a business was not doing.
M0n0wall came to the rescue. In 2007 with the still small number of people using the system the IT group took an old P3 system with two NIC’s and put in a captive portal that made the lawyers happy.
In 2008 the web started getting really slow on this free internet. The IT team started looking around and saw that a hand full of people were using 99% of the bandwidth. The IT team assumed that it was the same people that had made us put in the captive portal because of bittorrenting. M0n0wall came through again by setting limits on each user that was going through the captive portal.
Late 2008 had us change out the m0n0wall hardware from the P3 to a P4 2.0Ghz system. In 2009 we started having major issues with m0n0wall that people would not get the page to accept the agreement and could not get internet access. We finally traced this back to an issue that the DHCP server would give an IP address to a host that had already hit the accept button with another computer. The end result was kick the user from captive portal and all was fine. It was at this time when we saw so many phones that we moved from 1 class C network to 8 class C networks. Seeing that most phones did not click on the accept page this was just a need for DHCP IP’s. About this time m0n0wall started to fail when it came to showing the diag_dhcp_leases.php.
In 2010 we started to get slow and we upgraded our hardware to a P4 2.8GHZ system with a load of RAM. This seemed to help some but it was OK but not what we had hoped for. Mid 2011 the CPU on the m0n0wall box is constantly at 100%. The IT team looked at the people that were bittorrenting as the issue. After several months finding and kicking the people that were using bittorrent we realized this was not helping much. Late 2011 the IT team started looking for why the CPU was at such a high load. What we finally realized was that we were using 100% of the CPU but top did not show us were it was. The m0n0wall server had an unknown number of IP’s that it was handing out as the diag_dhcp_leases.php would not return and about 850 systems when it came to people that had click the accept button on the captive portal. After reading the m0n0wall forms and doc’s the team finally realized that the PCI bus on the P4 was overloaded. We decided to change out the old P4 with an older Xeon based server that had a PCI-X bus with Intel cards. This was all hardware that had been decommissioned from another project. Early February 2012 we installed this server and the m0n0wall did not seem to do much better.
At this point we called in a VAR to talk to us about a Cisco way to take care of the captive portal and then also limit the bandwidth use. It was in this talk with the VAR that the IT team came up with an idea that would keep m0n0wall in use. The idea was to have more than one m0n0wall server. We had not looked at this options because we wanted users to move from somewhere like the ER to a normal room without having to mess around with changing network settings. This solution would work only because the wireless controllers have to have the DHCP server listed for the network. The decision was made to move to a larger 64 class C network and have the first m0n0wall hand out IP’s from the first 1-10 class C’s and then the second m0n0wall server would hand out IP’s from the 11th-20th class C’s. Seeing that the netmask covered the whole location someone could get an IP in the ER from the first m0n0wall server and move to a room without losing connection. Now if they would release and renew their IP address while in the room they would get an IP address from the second m0n0wall server.
Here is an example IP setup of what I am talking about.
The full network would be 172.16.0.x – 172.16.127.254 M0n0wall 1 would be 172.16.0.1 and would had out IP’s 172.16.0.2-172.16.9.254 M0n0wall 2 would be 172.16.10.1 and would hand out IP’s 172.16.10.2-172.16.19.254
This only worked as the controllers allow us to specify the DHCP server. We have seen a great improvement in speed but the funny thing is we are going to roll out a third m0n0wall server at some point.
When looking at the forums I saw many people with similar issues of running out of DHCP leases and then some that I think now might have been killing their bus. This info is here for you and I hope you can find a way to make things work well.
|