News: This forum is now permanently frozen.
Pages: [1]
Topic: IPv6 works but IPv4 doesn't  (Read 5548 times)
« on: November 22, 2010, 23:13:45 »
Hans Maulwurf **
Posts: 56

So I've been running this version for some time now (generic-pc) and have some interesting problems, which are not really related to IPv6:
My setup is a PPPoE connection for WAN and two NICs for LANs (LAN and OPT1) plus a hacked-in OPT2 (via downloading, modifying and re-uploading the config file) which is basically the WAN NIC. I need this to assign an IP address to it, so I can connect to my modem's web interface and read some status values from it.
Now at least once in 24 hours the ISP will drop the connection so m0n0wall will redial immediately.
With this build, every 2 or 3 days when the WAN connection is dropped again and m0n0wall re-establishes the connection, for no reason, all interfaces are almost dead. This means I cannot connect to any websites anymore from any LAN computer, not even to m0n0wall's web interface. Pinging m0n0wall still works, DHCP also works, but with a delay of about 10 seconds for each reply. DNS does not work.
I'm relaying syslog to another machine on LAN and I can see that there are still things going on, it even successully establishes the WAN connection again.
What is even more strange is that sometimes in this state, the sixxs tunnel still works perfectly fine (no DNS of course, but I can connect to ipv6 enabled hosts if I know their address, eg. the httpd on my dedicated server).
I also ran top on the terminal to see if any process would lock up the CPU, but I can see nothing unusual there.
All I can say is that it happens when the pppoe connection gets terminated, but I don't know if its caused by the termination or by re-establishing it. Might aswell be that the connection loss is a result of m0n0wall going into this weird state.

Any thoughts on that? 1.32 worked like a charm.
« Reply #1 on: November 23, 2010, 01:03:48 »
brushedmoss ****
Posts: 446

Hi Hans,  sounds like a NAT issue if your ipv4 is broken.

I suspect your NAT is confused due to your config for modem access, can you 'undo' it and see or check your nat status after a reconnect ?

Why this worked in 1.32 I'm not sure.

There is a newer image you can test here, but it just has a fix for dhcp-pd , so I don't expect it to fix this

http://m0n0.ch/temp/embedded-1.33-pre3.img
« Reply #2 on: January 08, 2011, 16:53:10 »
Hans Maulwurf **
Posts: 56

I found that this is really hard to debug.
Could this even be related to hardware issues?

Because after posting this, it worked for almost two weeks before I got that behaviour again. And as said before, there is at least one reconnect including IP change per day. But if it happens, it always happens right after a reconnect/IP change. Could it be something like a race condition? If there are multiple scripts that get executed in parallel on a reconnect, maybe a specific order of commands they issue triggers it.
So, when it happened again after my posting from the 22nd of Nov, I removed OPT2 from my config (even made sure there is no occurence of anything related to it left in the xml) and rebooted. That was around the 10th of December. And today, it happened again. IPv4 was dead, IPv6 still working. But it can't be just a NAT issue, since the webinterface is also not reachable anymore, which should not require any NAT or routing.

Any idea how to debug this? I can put stuff into the image that I could run on the terminal, like I added top as an option there. Also, syslog still sends messages over the network, so sending packets seems to be ok.

The problem is that you just might think you fixed it, but then 4 weeks later you get the same issue again. So 1) it might take very long to get the symptoms again and 2) if it happens, there is not much you can do, since the network is almost dead.
« Reply #3 on: January 08, 2011, 23:21:43 »
brushedmoss ****
Posts: 446

I have a cable modem at home and did the

Quote
hacked-in OPT2

and had this problem intermittently in 1.32 , weeks or days apart, and not on every release/renew but directly on some.  

for me, ipv6 was not effected and DNS wasn't as I was using the built in forwarder etc, and the webgui worked for me too as these weren't effected by NAT in my setup.  

I undid the NAT setup so I could access the cablemodem, and haven't had a problem since (months).

Next time it happens, check if you can access the webui via ipv6 and look at your nat status with /status.php

The only change I can think of with 1.33 from 1.32 is the NIC driver change for some realtek cards, what NIC are you using ?
« Reply #4 on: February 15, 2011, 14:57:19 »
Hans Maulwurf **
Posts: 56

So I got the same issue again and thanks to IPv6 still working I could get a snapshot of the status.php

The problem is that the filter rules in the "ipfstat -nio" section are messed up.
Usually it looks like this:
Quote
...
@12 pass in quick on ng0 proto udp from any port = bootps to any port = bootpc
@13 block in log quick on vr0 from !192.168.0.0/24 to any
@14 block in log quick on fxp1 from !192.168.3.0/24 to any
@15 block in log quick on ng0 from 10.0.0.0/8 to any
@16 block in log quick on ng0 from 127.0.0.0/8 to any
@17 block in log quick on ng0 from 172.16.0.0/12 to any
@18 block in log quick on ng0 from 192.168.0.0/16 to any
@19 skip 1 in proto tcp from any to any flags S/FSRA
@20 block in log quick proto tcp from any to any

@21 block in log quick on vr0 all head 100
@22 block in log quick on ng0 all head 200
@23 block in log quick on fxp1 all head 300
...
But now I got this:
Quote
...
@48 block in quick on ng24 proto tcp/udp from any to any port = microsoft-ds
@49 block in quick on ng25 proto tcp/udp from any to any port = microsoft-ds
@50 block in quick on ng26 proto tcp/udp from any to any port = microsoft-ds
@51 block in log quick on fxp1 from !192.168.3.0/24 to any
@52 block in log quick on ng0 from 10.0.0.0/8 to any
@53 block in log quick on ng0 from 127.0.0.0/8 to any
@54 block in log quick on ng0 from 172.16.0.0/12 to any
@55 block in quick on ng27 proto tcp/udp from any to any port = microsoft-ds
@56 block in quick on ng28 proto tcp/udp from any to any port = microsoft-ds
@57 block in quick on ng29 proto tcp/udp from any to any port = microsoft-ds
@58 block in log quick on ng0 from 192.168.0.0/16 to any
@59 skip 1 in proto tcp from any to any flags S/FSRA
@60 block in quick on ng30 proto tcp/udp from any to any port = microsoft-ds
@61 block in quick on ng31 proto tcp/udp from any to any port = microsoft-ds
@62 block in quick on ng32 proto tcp/udp from any to any port = microsoft-ds
@63 block in log quick proto tcp from any to any
@64 block in log quick on vr0 all head 100
@65 pass in quick on ng1 proto tcp/udp from 192.168.9.128/27 to 192.168.0.1/32 port = domain keep state
@66 pass in quick on ng2 proto tcp/udp from 192.168.9.128/27 to 192.168.0.1/32 port = domain keep state
@67 block in log quick on ng0 all head 200
@68 block in log quick on fxp1 all head 300
...
The skip rule is not right before the "block all tcp" line, so obviously everything stops working. It looks like the PPTP interfaces get mixed in.
It's confirmed by these kind of lines in the filter-log section:
Quote
eb 11 22:40:23 m0n0wall ipmon[136]: 22:40:22.598961 vr0 @0:63 b 192.168.0.2,55552 -> 188.40.58.150,993 PR tcp len 20 48 -S IN

Tell me if you need the whole status.php output, I'll send it to you via PM.
« Reply #5 on: February 16, 2011, 01:53:51 »
brushedmoss ****
Posts: 446

looking at the code,  I don't see how you ended up with that filter  Huh

yes, pls PM the entire output of status.php or post the 'unparsed ipfilter rules' section please
« Reply #6 on: May 09, 2011, 13:55:42 »
Hans Maulwurf **
Posts: 56

As I had some spare time, I modified filter.inc myself and added a syslog message when certain functions are entered and left.
I also added a parameter to  filter_configure() and modified every call I could find to that function to pass some ID, so I can see where the call is coming from. Unfortunately, I just looked for calls in the *.inc files in etc/inc, so I missed most of the automatically triggered calls. I'll see that I can modify the other calls too.
What It reveals is that there are four calls to  filter_configure() after the PPPoE connection is (re-)established, so my first guess would be that in rare cases, (the last) two of them overlap because they might be triggered by two concurrently running scripts. So they both clear the table first and then fill them up with the rules. As far as I can tell, that would mean that two scripts execute the line fwrite($fd, $ipfrules); in filter_configure() concurrently, but even then I'm not sure if this would make ipf mix them up, or if it can handle such a situation...
Here's what I got in my log. No overlaps, everything works fine. I'll just wait for the next incident, which will hopefully show an overlap.
Quote
<30>May  9 13:36:43 mpd: [pppoe] IFACE: Up event
<11>May  9 13:36:44 php: ### ENTER filter_configure(undef)
<11>May  9 13:36:44 php: ### EXIT filter_configure() in SUCCESS
<11>May  9 13:36:47 php: ### ENTER filter_configure(undef)
<11>May  9 13:36:47 php: ### EXIT filter_configure() in SUCCESS
<190>May  9 13:36:50 sixxs-aiccu: Succesfully retrieved tunnel information for Txxxxx
<190>May  9 13:36:50 sixxs-aiccu: AICCU running as PID 1660
<11>May  9 13:36:51 php: ### ENTER filter_configure(int 1159)
<11>May  9 13:36:51 php: ### EXIT filter_configure() in SUCCESS
<11>May  9 13:36:52 php: ### ENTER filter_configure(undef)
<11>May  9 13:36:52 php: ### EXIT filter_configure() in SUCCESS
« Reply #7 on: May 09, 2011, 14:27:58 »
Manuel Kasper
Administrator
*****
Posts: 364

Thanks for taking the time to investigate this - it's always nice when users are able to do their own debugging, especially with such sporadic issues! I also think it's possible that the firewall reconfiguration somehow ran in parallel and messed up the rule set, however that doesn't explain the 10 second DHCP delay that you've experienced - very odd.

As a little hint - instead of adding an identifier to every call, perhaps you could simply obtain information about the caller using debug_backtrace() (http://php.net/manual/en/function.debug-backtrace.php).
« Reply #8 on: May 09, 2011, 19:15:01 »
Hans Maulwurf **
Posts: 56

Ah, didn't really do anything in php before, debug_backtrace() seems really handy; I'm using that now.

I saw there's a function pair lock_file()/unlock_file()
If there'll really be some overlapping calls when I get these symptoms next time I will try if using locking in filter_configure() solves the issue.
« Reply #9 on: May 09, 2011, 22:09:36 »
brushedmoss ****
Posts: 446

rc.newwanip has a lock file method , but I notice rc.newwanip6 (called by aiccu) doesn't and both call filter_configure(). 

adding the lock to rc.newwanip6 might fix this

« Reply #10 on: May 10, 2011, 16:53:50 »
Hans Maulwurf **
Posts: 56

Yep, figured this out today thanks to the backtrace.

rc.newwanip calls interfaces_wan_configure6(), which creates the gifwatch.sh that execures rc.newwanip6 once the tunnel is up.

After creating gifwatch.sh and launching it, interfaces_wan_configure6() will call interfaces_rtadvd_configure() which will also call filter_configure() after starting rtadvd.
So if the tunnel comes up fast enough and/or setting up rtadvd takes long enough, they both run filter_configure() in parallel.


So I guess I'll add the same lock ("{$g['varrun_path']}/newwanip.lock") to rc.newwanip6 later, but I want to wait until this actually really breaks once again. For science.
« Reply #11 on: May 10, 2011, 18:13:15 »
brushedmoss ****
Posts: 446

try

Code:
http://svn.m0n0.ch/wall/branches/freebsd8/phpconf/rc.newwanip6

and

Code:
http://svn.m0n0.ch/wall/branches/freebsd8/phpconf/inc/filter.inc

basically rc.newwanip6 will only change ipv6 rules and uses it's own lock.

usual caveats about breaking your system apply :-)
« Reply #12 on: May 11, 2011, 16:55:09 »
Hans Maulwurf **
Posts: 56

Quote
<30>May 11 16:07:34 mpd: [pppoe] IFACE: Up event
<11>May 11 16:07:35 php: ### Call(2): /etc/rc.newwanip:62 filter_configure()
<11>May 11 16:07:35 php: ### writing nat rules...
<11>May 11 16:07:35 php: ### DONE writing nat rules...
<11>May 11 16:07:35 php: ### writing ipv4 rules...
<11>May 11 16:07:36 php: ### DONE writing ipv4 rules...
<11>May 11 16:07:36 php: ### writing ipv6 rules...
<11>May 11 16:07:36 php: ### DONE writing ipv6 rules...
<11>May 11 16:07:36 php: ### Exit(2): filter_configure()
<11>May 11 16:07:38 php: ### Call(3): /etc/inc/vpn.inc:539 filter_configure() <= /etc/rc.newwanip:75 vpn_ipsec_configure(1)
<11>May 11 16:07:38 php: ### writing nat rules...
<11>May 11 16:07:38 php: ### DONE writing nat rules...
<11>May 11 16:07:38 php: ### writing ipv4 rules...
<11>May 11 16:07:38 php: ### DONE writing ipv4 rules...
<11>May 11 16:07:38 php: ### writing ipv6 rules...
<11>May 11 16:07:39 php: ### DONE writing ipv6 rules...
<11>May 11 16:07:39 php: ### Exit(3): filter_configure()
<190>May 11 16:07:40 sixxs-aiccu: Succesfully retrieved tunnel information for Txxxxx
<190>May 11 16:07:40 sixxs-aiccu: AICCU running as PID 11858
<11>May 11 16:07:41 php: ### Call(4): /etc/inc/interfaces.inc:1160 filter_configure(int 1159) <= /etc/inc/interfaces.inc:794 interfaces_rtadvd_configure() <= /etc/rc.newwanip:79 interfaces_wan_configure6(1)
<11>May 11 16:07:42 php: ### writing nat rules...
<11>May 11 16:07:42 php: ### DONE writing nat rules...
<11>May 11 16:07:42 php: ### Call(2): /etc/rc.newwanip6:52 filter_configure()
<11>May 11 16:07:42 php: ### writing nat rules...
<11>May 11 16:07:42 php: ### writing ipv4 rules...
<11>May 11 16:07:42 php: ### DONE writing nat rules...
<11>May 11 16:07:43 php: ### writing ipv4 rules...
<11>May 11 16:07:43 php: ### DONE writing ipv4 rules...
<11>May 11 16:07:43 php: ### writing ipv6 rules...
<11>May 11 16:07:43 php: ### DONE writing ipv6 rules...
<11>May 11 16:07:43 php: ### Exit(2): filter_configure()
<11>May 11 16:07:43 php: ### DONE writing ipv4 rules...
<11>May 11 16:07:43 php: ### writing ipv6 rules...
<11>May 11 16:07:43 php: ### DONE writing ipv6 rules...
<11>May 11 16:07:43 php: ### Exit(4): filter_configure()
Gotcha! Now its time to fix it. Smiley


I'm switching to the one with a lock in newwanip6 now posted by brushedmoss. Also changed the rtadvd_configure in interfaces.inc to call the ipv6 version of filter_configure. Let's see if this issue is gone for good then. Smiley
Thinking about it a bit more, wouldn't it be best to add some locks to the filter_configure(6) function directly? What happens if the WAN ip gets renewed and at the same time I modify my IPv4 rules via webgui and hit "apply"? That's a million times less likely to happen than what I experienced here, but still possible in theory...
The big problem here is that it can lock you out of the webgui entirely, which really sucks if the box isn't sitting right next to you.
« Last Edit: May 11, 2011, 16:57:32 by Hans Maulwurf »
« Reply #13 on: May 11, 2011, 20:36:26 »
brushedmoss ****
Posts: 446

I was think it through more today, I'll post a better fix, probably put locking in the function , and then modify the other places that should only call filter_configure6() like changing ipv6 rules etc...

You are of course welcome to submit a patch Smiley
 
Pages: [1]
 
 
Powered by SMF 1.1.20 | SMF © 2013, Simple Machines