Weblogs for fugit
A Happy Friday post about fun one liners. Please post any one liners you have created that probably should have been a short shell script or any improvements for the below one liner. One improvement would be to get rid of the expr output.
Below was something I wrote just to quickly find if I had retired all the hosts that should have been retired from a txt file. I also used a slight variant of the code to check if DNS had been deleted.
UPDATE:
The reason for the expr was I has having problems nesting the if statements inside the backticks. After mcortese comment I updated the script to use an if statement outside of the backticks. RJC provided an alternative to backticks using posix compliant "$()". I have updated the script but did not nest the if statements which would be easier with "$()".
One Liner for x in `grep KILL list.txt | awk '{ print $6 }'` ; do add=`host $x | grep -v NXDOMAIN | awk '{ print $4 }'` ; if [ "$add" ]; then ping -c1 -W1 $add >/dev/null; expr 1 / $? 2>/dev/null || echo $x ; fi ; done
Example txt from my file below. The servers were all local so I was able to reduce the timeout. In order for the above code to work you will need to be able to lookup the hostname and ping the ip returned:
Text File KILL 300 62 running 10.200.5.25 woody.example.com
KILL 504 65 running 10.200.5.26 potato.example.org
505 15 running 10.200.4.27 etch.example.net
KEEP 506 29 running 10.200.3.28 squeeze.example.co.uk
KILL 511 28 running 10.200.3.29 hamm.example.info
KILL 525 30 running 10.200.1.69 bo.example.xxx
KEEP 526 24 running 10.200.2.254 wheezy.example.tv
The Improved One Liner Based on Comments: for x in $(awk '/KILL/{ print $6 }' /tmp/list.txt ) ; do add=$(host $x | awk '!/NXDOMAIN/{ print $4 }') ; if [ "$add" ]; then ping -c1 -W1 $add >/dev/null; res="$?"; fi; if [ "$res" -eq 0 ] ;then echo $x ; fi ; done
Update 20110819: Per ian@ianbmacdonald.com comment "The parameters need to be bond_xmit_hash_policy and bond_lacp_rate. ... You can see in the /proc/net/bonding/bond0 output that the policy is set to "layer2" not "layer 2+3" as per the configuration (because of this error)." I have updated the /etc/init.d/interfaces entry below to reflect this.
The Solution:
Setup a new Debian Squeeze Openvz server with bonding (802.3ad) and vlan turnking(802.1Q). This article covers the process of getting vlan and bonding working on Debian Squeeze with a cisco switch running IOS.
Cisco Setup:
Cisco Hardware
We are using a cisco 6509 switch with gigabit ethernet module that supports 802.3ad. For more information regarding the different bonding options you can check out this link I have not tried getting this to work with non 802.3ad (Dynamic link aggregation) capable switch.
Setup the port channel
interface Port-channel30 description ServerName switchport switchport trunk encapsulation dot1q switchport trunk allowed vlan 48,49 switchport mode trunk no ip address endConfigure the physical interfaces on the cisco switch:
interface GigabitEthernet9/5 description ServerName#1 switchport switchport trunk encapsulation dot1q switchport trunk allowed vlan 48,49 switchport mode trunk no ip address stack-mib portname ServerName#1 no snmp trap link-status no cdp enable channel-protocol lacp channel-group 30 mode active end interface GigabitEthernet9/19 description ServerName#2 switchport switchport trunk encapsulation dot1q switchport trunk allowed vlan 48,49 switchport mode trunk no ip address stack-mib portname ServerName#2 no snmp trap link-status no cdp enable channel-protocol lacp channel-group 30 mode active end ...Make sure the the "switchport trunk allowed vlan" has the vlans you are going to be doing on the linux server. Until these matched it would not work for me.
Linux Network Config:
Install the required pacakges and load bonding module
apt-get install vlan ifenslave modprobe bondingInterfaces Config: /etc/network/interfaces
auto bond0
iface bond0 inet manual
bond-mode 4
bond-miimon 100
bond_xmit_hash_policy layer2+3
bond_lacp_rate slow
slaves eth0 eth1 eth2 eth3
auto vlan48
iface vlan41 inet static
vlan_raw_device bond0
address 10.169.48.77
netmask 255.255.255.0
network 10.169.48.0
broadcast 10.169.48.255
gateway 10.169.48.1
auto vlan49
iface vlan49 inet static
vlan_raw_device bond0
address 10.169.49.45
netmask 255.255.255.0
network 10.169.49.0
broadcast 10.169.49.255
gateway 10.169.49.1
If you happen to be using openvz I set the below for /etc/sysctl.conf. I have removed all of the comments and blank lines. You do not need this if you are not using OpenVZ. egrep -v '^#|^$' /etc/sysctl.conf net.ipv4.icmp_echo_ignore_broadcasts=1 net.ipv4.conf.eth0.proxy_arp=1 net.ipv4.conf.bond0.proxy_arp=1 net.ipv4.conf.default.forwarding=1 net.ipv4.conf.default.proxy_arp = 0 net.ipv4.ip_forward=1 net.ipv4.conf.all.rp_filter = 0 kernel.sysrq = 1 net.ipv4.conf.default.send_redirects = 1 net.ipv4.conf.all.send_redirects = 0 fs.file-max = 100000sysctl is used on bootup so you need to run the below command to load the file.
/sbin/sysctl -p
Trouble Shooting:
On Linux
ServerName# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 7
Number of ports: 4
Actor Key: 17
Partner Key: 30
Partner Mac Address: 00:15:2c:79:c4:c0
Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: d4:85:64:54:1d:5c
Aggregator ID: 7
Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: d4:85:64:54:1d:5e
Aggregator ID: 7
Slave Interface: eth2
MII Status: up
Link Failure Count: 1
Permanent HW addr: d4:85:64:54:1d:84
Aggregator ID: 7
Slave Interface: eth3
MII Status: up
Link Failure Count: 1
Permanent HW addr: d4:85:64:54:1d:86
Aggregator ID: 7
ServerName# modinfo bonding filename: /lib/modules/2.6.32-5-openvz-amd64/kernel/drivers/net/bonding/bonding.ko author: Thomas Davis, tadavis@lbl.gov and many others description: Ethernet Channel Bonding Driver, v3.5.0 version: 3.5.0 license: GPL srcversion: C0EFCD8CB4AC214A8146EC2 depends: vermagic: 2.6.32-5-openvz-amd64 SMP mod_unload modversions parm: max_bonds:Max number of bonded devices (int) parm: num_grat_arp:Number of gratuitous ARP packets to send on failover event (int) parm: num_unsol_na:Number of unsolicited IPv6 Neighbor Advertisements packets to send on failover event (int) parm: miimon:Link check interval in milliseconds (int) parm: updelay:Delay before considering link up, in milliseconds (int) parm: downdelay:Delay before considering link down, in milliseconds (int) parm: use_carrier:Use netif_carrier_ok (vs MII ioctls) in miimon; 0 for off, 1 for on (default) (int) parm: mode:Mode of operation : 0 for balance-rr, 1 for active-backup, 2 for balance-xor, 3 for broadcast, 4 for 802.3ad, 5 for balance-tlb, 6 for balance-alb (charp) parm: primary:Primary network device to use (charp) parm: lacp_rate:LACPDU tx rate to request from 802.3ad partner (slow/fast) (charp) parm: ad_select:803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2) (charp) parm: xmit_hash_policy:XOR hashing method: 0 for layer 2 (default), 1 for layer 3+4 (charp) parm: arp_interval:arp interval in milliseconds (int) parm: arp_ip_target:arp targets in n.n.n.n form (array of charp) parm: arp_validate:validate src/dst of ARP probes: none (default), active, backup or all (charp) parm: fail_over_mac:For active-backup, do not set all slaves to the same MAC. none (default), active or follow (charp)On Cisco
show interfaces port-channel 30
Port-channel30 is up, line protocol is up (connected)
Hardware is EtherChannel, address is 0013.80c0.fa4c (bia 0013.80c0.fa4c)
Description: Punkinpuss
MTU 1500 bytes, BW 4000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s
input flow-control is off, output flow-control is off
Members in this channel: Gi9/5 Gi9/19 Gi11/45 Gi12/45
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output never, output hang never
Last clearing of "show interface" counters 4w3d
Input queue: 0/2000/7/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 34000 bits/sec, 8 packets/sec
5 minute output rate 120000 bits/sec, 112 packets/sec
13303252 packets input, 1748466512 bytes, 0 no buffer
Received 103127 broadcasts (101124 multicasts)
2 runts, 0 giants, 0 throttles
5 input errors, 0 CRC, 0 frame, 2 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
0 input packets with dribble condition detected
111206034 packets output, 42975015356 bytes, 0 underruns
3 output errors, 0 collisions, 1 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out
Links:openvz on debian
ubnutu bug report where I found my answer
bondong on debian
bondong on debian in a vmware instance
Conclusion:
I had a hard time finding all of the information required to setup vlan and bonding under squeeze so I put this howto together. Please feel free to post any questions or comments.
Setup selenium to work with firefox on a headless server. I wanted selenium to run as non root and start via init.d.
The Solution:
Install xvfb and firefox via apt and download the selenium jar to /usr/local/selenium. Then setup init scripts for xvfb and selenium.
xvfb
apt-get install xvfb
Setup the init script /etc/init.d/local-xvfb
#!/bin/bash
### CONFIG ###
XPORT=13
USER=selenium
### CONFIG ###
if [ -z "$1" ]; then
echo "`basename $0` {start|stop}"
exit
fi
case "$1" in
start)
su $USER -c "/usr/bin/Xvfb :$XPORT &"
;;
stop)
su $USER -c "killall Xvfb"
;;
esac
EOF
Test that xvfb init.d is working /etc/init.d/local-xvfb start ps aux | grep -i xvfb /etc/init.d/local-xvfb stop ps aux | grep -i xvfb /etc/init.d/local-xvfb startSetup local-xvfb to start on reboot.
update-rc.d local-xvfb 10
firefox(iceweasel)
Install iceweasel.
apt-get install iceweaselThat was easy... next.
Selenium
Create the selenium user and download the jar for selenium.
addgroup selenium useradd -g selenium -G selenium selenium mkdir /usr/local/selenium cd /usr/local/selenium wget http://selenium.googlecode.com/files/selenium-server-standalone-2.0b3.jar chown -R selenium.selenium /usr/local/seleniumSetup the init.d/ script to start selenium service on reboot. In order to get this to work as a non-root user the stop section of the script is not great. Please let me know if anyone has a better way and I'll update the script.
Setup the init script /etc/init.d/local-selenium
#!/bin/bash ### CONFIG ### # Based on http://robfan.com/post/122618829/continuous-integration-selenium-firefox-flash SELENIUM_HOME=/usr/local/selenium LOG_DIR=/var/log/selenium ERROR_LOG=$LOG_DIR/selenium_error.log STD_LOG=$LOG_DIR/selenium_std.log TMP_DIR=$SELENIUM_HOME/tmp PID_FILE=$TMP_DIR/selenium.pid JAVA=/usr/bin/java SELENIUM_APP="$SELENIUM_HOME/selenium-server-standalone-2.0b3.jar" USER=selenium ### END CONFIG ### case "${1:-''}" in 'start') if test -f $PID_FILE then PID=`cat $PID_FILE` if ps --pid $PID >/dev/null ; then echo "Selenium is running...$PID" exit 0 else echo "Selenium isn't running..." echo "Removing stale pid file: $PID_FILE" fi fi echo "Starting Selenium..." #echo "COMMAND: su $USER -c \"$JAVA -jar $SELENIUM_APP >$STD_LOG 2>$ERROR_LOG &\"" su $USER -c "$JAVA -jar $SELENIUM_APP >$STD_LOG 2>$ERROR_LOG &" error=$? if test $error -gt 0 then echo "${bon}Error $error! Couldn't start Selenium!${boff}" fi ps -C java -o pid,cmd | grep $SELENIUM_APP | awk {'print $1 '} > $PID_FILE ;; 'stop') if test -f $PID_FILE then echo "Stopping Selenium..." PID=`cat $PID_FILE` su $USER -c "kill -3 $PID" if kill -9 $PID ; then sleep 2 test -f $PID_FILE && rm -f $PID_FILE else echo "Selenium could not be stopped..." fi else echo "Selenium is not running." fi ;; 'restart') if test -f $PID_FILE then su $USER -c "kill -HUP `cat $PID_FILE`" test -f $PID_FILE && rm -f $PID_FILE sleep 1 su $USER -c "$JAVA -jar $SELENIUM_APP >$STD_LOG 2>$ERROR_LOG &" error=$? if test $error -gt 0 then echo "${bon}Error $error! Couldn't start Selenium!${boff}" fi ps -C java -o pid,cmd | grep $SELENIUM_APP | awk {'print $1 '} > $PID_FILE echo "Reload Selenium..." else echo "Selenium isn't running..." fi ;; 'status') if test -f $PID_FILE then PID=`cat $PID_FILE` if ps --pid $PID >/dev/null ; then echo "Selenium is running...$PID" else echo "Selenium isn't running..." fi else echo "Selenium isn't running..." fi ;; *) # no parameter specified echo "Usage: $SELF start|stop|restart|status" exit 1 ;; esacTest that selenium init script is working.
/etc/init.d/local-selenium start /etc/init.d/local-selenium status /etc/init.d/local-selenium restartSetup local-selenium to start on reboot.
update-rc.d local-selenium 95 5
Conclusion
I hope this helps anyone looking to setup selenium to start as a service and not have it running as root.
Integrating Debian(lenny) into an Active Directory(2008) forest with multiple trusted domains. We wanted to leverage AD for account management and Authentication including groups. One of the goals was to avoid modifing the accounts in AD. We did not want to enter unix attributes in AD for GID or UID.
The Solution:
Utilizing Samba's winbind, kerberos (krb5), nsswitch and pamd to leverage AD. Deployed and managed via puppet.
winbind
First this does not require a full installation of samba. We are going to only use the winbind portion of samba to make this work. Also I am using the backports version of winbind.
Information on using backports can be found here
Install the required packages for winbind:
apt-get install -t lenny-backports winbind samba-common-bin
Now we need to configure winbind. The file we will modify is /etc/samba/smb.conf. Below will work if you are just using winbind. There are other sections required if you will be using other features of samba.
/etc/samba/smb.conf [global] workgroup = WORKGROUP1 password server = ad1.domain1.com realm = DOMAIN1.COM security = ads template shell = /bin/bash winbind offline logon = false winbind separator = + kerberos method = secrets and keytab client ntlmv2 auth = yes winbind use default domain = yes winbind enum users = yes winbind enum groups = yes winbind nss info = rfc2307 idmap config DOMAIN1:backend = rid idmap config DOMAIN1:base_rid = 0 idmap config DOMAIN1:range = 100000 - 199999 idmap config DOMAIN2:backend = rid idmap config DOMAIN2:base_rid = 0 idmap config DOMAIN2:range = 200000 - 299999 # Map any users/groups that are not in the trusted domains to this: idmap backend = tdb idmap uid = 900000-950000 idmap gid = 900000-950000 # this is set by default (run testparm to see it) passdb backend = tdbsam # Refresh kerberos tickets winbind refresh tickets = yes
The reason the separator is changed in the above configuration is to allow for many of the unix tools to work with the domain accounts. The regular separator is "/" which do not work with toos such as ssh. If you only have one domain this is not strictly necessary.
kerberos
First install the required packages for kerberos to work. Note samba-common-bin required for "net" command, installed above.
apt-get install krb5-clients krb5-user ntpNow we need to configure kerberos. This configuration is for a AD Forest with Multiple domains in a trust. If you only have one domain you can remove the parts for multiple domains
/etc/krb.conf
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = DOMAIN1.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
forwardable = yes
default_keytab_name = FILE:/etc/krb5.keytab
[realms]
DOMAIN1.COM = {
kdc = ad1.domain1.com:88
kdc = ad2.domain1.com:88
admin_server = ad1.domain1:749
master_kdc = ad1.domain1.com
}
DOMAIN2.COM = {
kdc = ad1.domain2.com:88
}
[domain_realm]
.domain1.com = DOMAIN1.COM
domain1.com = DOMAIN1.COM
.domain2.com = DOMAIN2.COM
domain2.com = DOMAIN2.COM
[appdefaults]
pam = {
debug = false
ticket_lifetime = 36000
renew_lifetime = 36000
forwardable = true
krb4_convert = false
}
Join your server to the domain:
net ads join member -U {administrator}
Test the join: net ads testjoinNTP:
ntp is included in the install of kerberos because kerberos is dependent on the time of the severs being correct. It should be pointed to the same ntp server as your AD servers. I'll be happy to be more verbose in the section if anyone has any questions.
nsswitch
nsswitch.conf is the System Databases and Name Service Switch configuration file, that is part of the base-files package in Debian (more information ).
The file /etc/nsswitch.conf needs to be changed to use winbind for passwd, group, and shadow:
/etc/nsswitch.conf (snippet) passwd: files winbind group: files winbind shadow: files winbind
pamd
pam is the Pluggable Authentication Modules for Linux (more information ).
There are several files we need to change in order to get authentication working with pam they are common-password common-session common-account common-auth. Need more details about each section and why they are changed to ...
common-password
/etc/pam.d/common-passwd password sufficient pam_unix.so nullok obscure md5 password sufficient pam_winbind.so use_first_pass password required pam_deny.so
common-session
/etc/pam.d/common-session session required pam_unix.so session required pam_mkhomedir.so umask=0022 skel=/etc/skel
common-account
/etc/pam.d/common-account account sufficient pam_winbind.so account sufficient pam_unix.so account required pam_deny.so
common-auth Need to explain this section, including why I am using sid as apposed to name or gid. Also how does one get the sid using getent.
/etc/pam.d/common-auth auth sufficient pam_winbind.so require_membership_of=S-x-x-xx-xxxxxxxx-xxxxxxxxxx-xxxxxxxxxx-2777 auth sufficient pam_winbind.so require_membership_of=S-x-x-xx-xxxxxxxx-xxxxxxxxxx-xxxxxxxxxx-1190 auth sufficient pam_unix.so nullok_secure use_first_pass auth required pam_deny.soWe are using the SID in common auth because it is a unique identifier as apposed to the rid or group name which are not guaranteed to be unique.
Overview
In order to test you can use getent (man) Using getent you should now be able to find a user on the primary domain "getent user | grep {user}". The results should looke something like below:
getent passwd | grep fugit fugit:*:101234:100123:Fugit Fugit:/home/DOMAIN1/fugit:/bin/shYou should be able to run the command "getent group | grep DOMAIN2" and see the AD groups for domain2. You can do the same for users with the command "getent passwd | grep {user}"
getent passwd | grep tempus DOMAIN2+tempus:*:202132:200123:tempus:/home/DOMAIN2/tempus:/bin/bashIn the above section please notice the '+' after the domain. This is needed in order to allow common unix tools such as ssh to work. If you are seeing all of your users but ssh isn't working please ensure you are using a '+' instead of a '/' as the domain separator. Also some trouble shooting.
puppet
I am currently doing all of this via puppet except the "net ads join". I am hoping to be able to provide more details regarding handling this with puppet in the future.
Conclusion
I hope this was helpful to others trying to join linux servers to a Active Directory(2008) forest with multiple trusted domains.
References
battista article
Samba Guide
A user requested that we do post-commit rsyncs only when a trigger file is updated. This is a quick blogpost about doing the trigger files.
They did not require any additional security they just wanted to be able to deploy via a trigger file.
Below is the script I used to allow for trigger files:
#!/bin/sh
#CONFIG
DEBUG=""
REPOS="$1" # RESET AFTER GETOPTS
REV="$2" # RESET AFTER GETOPTS
SVNLOOK=/usr/local/bin/svnlook
TRIGGER_PAIRS=":" #spaces between each pair
while getopts "d" optionName; do
case "$optionName" in
d) DEBUG="1";;
[?]) echo "Usage: $0 [-d] "
esac
done
shift $(($OPTIND - 1))
# SET AFTER GETOPTS
REPOS="$1"
REV="$2"
#MAIN
# CHECK FOR TRIGGERED SYNCS
for trigger_pair in $TRIGGER_PAIRS
do
TRIGGER_FILE=`echo $trigger_pair |awk -F: {' print $1 '}`
COMMIT_FILE=`echo $trigger_pair |awk -F: {' print $2 '}`
# DEBUG
if [ -n "$debug" ]
then
echo "TRIGGER_FILE: $TRIGGER_FILE" # DEBUG
echo "TRIGGER_PAIR $trigger_pair" # DEBUG
echo "COMMIT_FILE: $COMMIT_FILE" # DEBUG
fi
if [ ! -z "$( $svnlook changed -r $rev $repos | egrep "$trigger_file" $changed )" ]
then
# DEBUG
if [ -n "$debug" ]
then
echo TRIGGER DEBUG: $REPOS/hooks/${COMMIT_FILE} "$REPOS" "$REV" # DEBUG
fi
$REPOS/hooks/${COMMIT_FILE} "$REPOS" "$REV"
fi
done
# END CHECK FOR TRIGGERED SYNCS #
# NON TRIGGERED SYNCS
# CALL YAML:SVN::NOTIFY SYNC for
$REPOS/hooks/ "$REPOS" "$REV"
# END NON TRIGGERED SYNCS
You don't need all of the debug info but I found it helpful. There is a horrible hack that I was to lazy to fix for re-setting the repo and rev after getopt.
The script can take multiply pairs of trigger files and the path to be rsynced if the trigger file was updated. All of the work is done under main. First the variables are read in and then it checks to see if the trigger file was updated using svnlook.
At the end of the script I call other svn::notify::config setups that get synced with out triggers.
Please post any comments if you would like more info on the trigger script or using svn::notify.
We were having an issue where some TLS connections were failing with "SSL_accept error from". There were a couple domains but all microsoft was one of the larger legitimate ones we were having a problem with.
quick answer -> The Answer
Log Entry:
SSL_accept error from smtp.microsoft.com[131.107.115.212]: -1
The problem only started occurring after an upgrade from debian etch to debian Lenny. One server had not been upgraded yet and could successfully handle all mails that the upgrdae servers were getting the "SSL_accept error from". This meant something had changed during the upgrade process that was causing this error. To give you an idea of the scope of the issue we get about 7500 TLS e-mails per day and 7000 were working fine. Only about 500 were failing on the upgraded mail servers and then working on the older etch server.
Setup Details
Here are more details on the different systems. The new servers were running debian Lenny with postfix 2.5.5-1.1 and openssl 0.9.8g-15+Lenny6. The system that hadn't been upgraded and what all the other were running before the upgrade is running debian etch with 2.3.8-2+etch1 0.9.8c-4etch9.
TROUBLE SHOOTING
The first thing I did was to check that my certs were good even though 7000 messages a day were working fine I wanted to double check. Using openssl and the directions at:
http://www.cyberciti.biz/faq/test-ssl-certificates-diagnosis-ssl-certificate/ I confirmed the certs and the fact that ssl was working.
One thing to note while testing TLS from openssl with the following command:
openssl s_client -starttls smtp -crlf -connect mail2.xxx.com:25 -CApath ~/.cert/mail2.xxx.com -cipher RC4-MD5I found out about a bug running the s_client within openssl. It is not a perfect client and has some limitations. I have always used all caps when doing trouble shooting in a SMTP conversation. This turned out to be a problem with openssl s_client. Specificly when doing RCPT TO: it kept RENEGOTIATING. You must use "rcpt to:" and NOT "RCPT TO:", the first note of this I could find was at http://archives.neohapsis.com/archives/postfix/2007-01/1334.html. Well it is still a problem.
Why did I force the RC4-MD5 cipher? I had noticed most of the failures were using this cipher, just turned out to be a coincidence.
Ok back to what else I did to try and trouble shot this issue. I made the mistake of trying to turn up TLS logging passed level 2 which wasn't showing me anything. I had done this on a spamtrap server and the logs directory filled in under 2 hours. So this was not a good idea. I forced a log rotate and looked to peer logging. I turned on peer logging for several of the domains that were having the issue. The main.cf entries for postfix are below:
postfix/main.cf (snippet) debug_peer_list = microsoft.com, XXX.com, XXX.com debug_peer_level = 3
After I had turned logging way up for just those domains I did not really see anything on the servers that were having the problems. It still looked like they were just connecting and then dropping off.
At this point I decided to try a brute force approach and upgraded the TLS and postfix packages to squeeze on one of the spamtrap servers. However in order to test it I moved around the MX weights. When I came in the next day it took about 2 minutes to notice it was still having a problem. On the central log server I was monitoring the mail log for anything with TLS or SSL in the line. This gave a good picture of the problem. I quickly saw someone connect to the spamtrap(now with a normal MX record) server get dropped and then switch to the secondary server running etch and have the transaction run with no problems.
tail -f /var/log/mail.log | egrep 'TLS|SSL' primary1 postfix/smtpd[13528]: setting up TLS connection from mailserver.xxx.com[xx.217.202.16] primary1 postfix/smtpd[13528]: SSL_accept error from mailserver.xxx.com[xx.217.202.16]: -1 primary1 postfix/smtpd[13528]: lost connection after STARTTLS from mailserver.xxx.com[xx.217.202.16] primary2 postfix/smtpd[10291]: setting up TLS connection from mailserver.xxx.com[xx.217.202.16] primary2 postfix/smtpd[10291]: SSL_accept error from mailserver.xxx.com[xx.217.202.16]: -1 primary2 postfix/smtpd[10291]: lost connection after STARTTLS from mailserver.xxx.com[xx.217.202.16] secondary setting up TLS connection from mailserver.xxx.com[xx.217.202.16] secondary postfix/smtpd[14665]: TLS connection established from mailserver.xxx.com[xx.217.202.16]: TLSv1 with cipher RC4-MD5 (128/128 bits)
Ok back to the drawing board. After some more searching http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=573748 didn't exactly match but I figured I would try compiling from source and include the extra library call. I was running out of ideas and I was getting desperate. That failed as quickly as the idea of upgrading. At this point I looked at how hard it would be to downgrade postfix and openssl. Well that didn't look fun and didn't provide the answer which was almost as important as fixing the issue.
During this time with the increased logging I was also using ssldump to try and get more information regarding the problem. Both on the working server and the servers that were having a problem.
ssldump -i eth0 -AadnkxX -k /etc/postfix/tls/key.pem ssldump -i eth0 -AadnkxX -k /etc/postfix/tls/key.pem > /tmp/ssldump.txtAgain the connections dropped off without any additional information in the logs on the primary servers running lenny and the secondary server reported nothing special.
I didn't include that I had gone to IRC #postfix and #openssl with little luck. One person had suggested it could be a library missmatch. When I first got this answer there was only one domain having the problem, even if it was a very large 3rd party e-mail handler. However as I continued to monitor the problem and noticed more legitimate e-mail servers having the same problem that answer was no longer sufficient. I keep using microsoft as an example as they are very large, legitimate and posting about them having a problem won't hurt anyones feelings.
After revisiting IRC on freenode I failed to get any additional information regarding my problem. After not getting anywhere on freenode I thought I would ask my local friends on irc-debian.org. I asked "anyone want to help trouble shoot postfix openssl tls issue? Its starting to get to me :)"
SOLUTION
Lucky for me dkg volunteered to help look into the situation and we discussed some information back and forth. dkg had another idea "weight 20(secondary etch server) has a smaller handshake than the other ones." The secondary server running etch has a 17KB handshake and the others have a 21KB handshake. I go ahead and do a dpkg-reconfigure ca-certificates to reduce the size of the handshake. I figure the etch server is working so I remove any of the ca roots that are not on the etch server.
Low and behold I tell dkg I owe him a beer as this fixed the problem.
Some interesting notes
The root Authorities removed did not include Microsoft's root Authority and the TLS connections with them were Trusted once the issue was resolved. An other interesting problem that also got resolved was that when I started writing the ssldumps to files I noticed that I was getting handshake errors about the length of the handshake being to short. After getting the root certificates file down below 20k, ~17k that error went away as well.
I wrote all of this in the hopes that I could save someone else 2 weeks of losing their wits and having to find a very obscure solution to a difficult problem.
Please feel free to leave any comments, suggestions, questions or any corrections :)
We were having an issue where some TLS connections were failing with "SSL_accept error from". There were a couple domains but all microsoft was one of the larger legitimate ones we were having a problem with.
For the Curious-> The Detailed version
SOLUTION
Lucky for me dkg volunteered to help look into the situation and we discussed some information back and forth. dkg had another idea "weight 20(secondary etch server) has a smaller handshake than the other ones." The secondary server running etch has a 17KB handshake and the others have a 21KB handshake. I go ahead and do a dpkg-reconfigure ca-certificates to reduce the size of the handshake. I figure the etch server is working so I remove any of the ca roots that are not on the etch server.
Low and behold I tell dkg I owe him a beer as this fixed the problem.
Some interesting notes
The root Authorities removed did not include Microsoft's root Authority and the TLS connections with them were Trusted once the issue was resolved. An other interesting problem that also got resolved was that when I started writing the ssldumps to files I noticed that I was getting handshake errors about the length of the handshake being to short. After getting the root certificates file down below 20k, ~17k that error went away as well.
I wrote all of this in the hopes that I could save someone else 2 weeks of losing their wits and having to find a very obscure solution to a difficult problem.
Please feel free to leave any comments, suggestions, questions or any corrections :)
Just had way to much Food again the day after Thanksgiving.
Going to check out some other peoples blogs. Hopefully I'll get motivated and put stuff people might actually be interested in reading, after this week.
Fugit... going to enter a food coma. . .