Server Admin Log/Archive 13 - Wikitech
Jump to content
From Wikitech
Server Admin Log
June 30
23:58 Tim: killed gearmanWorker.php instances on hume running with the old master conf, re-ran them running as apache
23:20 Andrew: 23:19 <@brion> !log s1 switched to db16, enwiki back online! (thanks tim!) tomasz rebooted wikitech linode which also borked. :P
22:58 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'put non-enwiki back to read/write'
22:57 brion: db14 (s1 master) is in some kind of borked state. slaves seem up to date; gonna try a master switch.
22:49 logmsgbot: brion synchronized php-1.5/CommonSettings.php
22:49 brion: putting site to read-only
22:44 Andrew: Rolling back updates, first vector skin, then config changes. If that doesn't help, will roll back UsabilityInitiative changes
22:43 logmsgbot: andrew synchronized php-1.5/skins/Vector.php 'Rolling back updates to r52581'
22:29 Andrew: db14 seems to be overloaded
22:27 Andrew: reports of (Cannot contact the database server: Unknown error (10.0.6.24))
22:24 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Remove vector from wgSkipSkins on usabilitywiki'
22:21 Andrew: Updated Vector and UsabilityInitiative, deployed UsabilityInitiative to usabilitywiki. Scapping to apply.
21:06 tomaszf: restarting lighttpd on storage2 due to large i/o wait
20:01 logmsgbot: andrew synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php
18:44 tomaszf: dropping -grosley dns entries in favor of dev. and civicrm.
16:07 Fred: restarted mwserve on pdf1 as it was not processing pdfs anymore.
15:56 brion: activity on pdf1 seems to have croaked around 15:30. pdf daemon might need a restart?
04:28 logmsgbot: tstarling synchronized php-1.5/extensions/timeline/Timeline.php
04:27 logmsgbot: tstarling synchronized php-1.5/extensions/timeline/EasyTimeline.pl
04:27 Tim: updated EasyTimeline to r52591
03:34 Rob: seeing good results with the firewall plugin on the corporate blog, copying it over to the techblog and setting it to active.
02:57 Rob: It was net gremlins!
02:56 brion: appears to have been temporary networking issue (external?) affecting access to some clients
02:48 brion: some sort of downtime reported for a few minutes. no apparent current problems persisting, all looks well
02:11 hcatlin: Mobile site is having some encoding issues with UTF-8 characters, therefore the redirects have been disabled from common.js
June 29
23:03 logmsgbot: andrew synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php 'Updated l10n'
23:00 Fred: updated DNS for *.m.wikipedia.org to point to mobile1 instead of eiximenis
22:05 logmsgbot: andrew synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.js 'Trevor told me to!'
22:05 logmsgbot: andrew synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.css 'Trevor told me to!'
21:47 Fred: apache stuck on srv[210-214] - restarting.
21:36 Andrew: scapping for vector updates
21:35 Andrew: updated Vector skin to r52581
21:18 Andrew: Lost connection to zwinger mid-scap. Running scap again to make sure everything's fine.
21:10 Andrew: Scapping to deploy UsabilityInitiative CSS/JS changes, and license updates done by Fred.
21:09 Andrew: updated WikimediaMessages to push latest changes pushed to svn by siebrand.
20:58 Andrew: not scapping yet, deploying license change stuff too, waiting for Fred to do the rest of the updates for that
20:52 Andrew: scapping
20:49 Andrew: Activating UsabilityInitiative extension and removing vector from $wgSkipSkins on testwiki
20:38 RobH_A90: racked and installed mobile1
20:38 RobH_A90: pulled nehlam test server
19:17 Andrew: Fred svn up'd EditPage.php, Skin.php, MessagesEn.php and WikimediaMessages. I reversed the updates in EditPage.php and Skin.php and used svn merge -c to update EditPage.php and svn up -r to update Skin.php, cherry-picking only the correct updates (r52361)
19:01 Andrew: pdated UsabilityInitiative extension on test as prep for deployment on testwiki.
June 28
19:39 RobH_away: ran authdns-update
19:39 RobH_away: changing bayes mgmt IP (to make it sane and in line with the rest of the mgmt IP ranges) and adding the new IP to dns and reverse template files
18:58 brion: bayes back up
18:52 logmsgbot: midom synchronized php-1.5/db.php 'db28 live'
18:49 brion: poking LOM on bayes; box is down, EZ requested reboot.
08:54 domas: manually reset db28 position, based on innodb internal info, got some filesystem relaylog corruption
08:52 logmsgbot: midom synchronized php-1.5/includes/GlobalFunctions.php 'removing messages profiling hook'
06:43 Tim: rebooted db28 with /proc/sysrq-trigger
06:29 Tim: depooled db28, locked up in kswapd, needs reboot
June 27
07:42 domas: changed log expiration on db9 to 100 days
June 26
20:33 Rob: blog.wikimedia.org back online, very restrictive settings and improved security (i hope)
19:09 Rob: singer https services back online for survey. ocs, wm09schols
18:20 Rob: Singer Restore: survey.wikimedia.org - UP, ocs.wikimania2009.wikimedia.org - UP
17:11 Rob: pulled blog.wikimedia.org out of squid via changing its dns to point directly at singer. just to make it easier to secure and fix on the fly later today
15:03 Rob: firing singer back up for reinstall and restoration from the haxor
14:42 logmsgbot: midom synchronized php-1.5/includes/parser/ParserCache.php 'removing the pcache hack for now'
11:44 domas: halted singer, don't start it without contacting me.
07:27 logmsgbot: brion synchronized php-1.5/includes/parser/ParserCache.php 'Putting live hack for MJ article back. Total traffic spike down but article traffic still spiking.'
05:53 Tim: rebooting srv182, swap death since ~00:10 UTC
05:50 Tim: killed broken mysqld_safe instances on srv157, srv161, srv167, srv173, srv174, srv175, srv181, srv185, srv186
05:17 logmsgbot: jeluf synchronized php-1.5/mc-pmtpa.php 'replace srv113 by srv142'
05:14 logmsgbot: jeluf synchronized php-1.5/mc-pmtpa.php 'replace srv182 by srv92'
June 25
23:36 logmsgbot: brion synchronized php-1.5/includes/parser/ParserCache.php 'live hack to extend caching of Michael Jackson'
23:17 logmsgbot: brion synchronized php-1.5/includes/db/Database.php 'tweak the db err msg'
22:46 logmsgbot: brion synchronized php-1.5/mc-pmtpa.php 'putting 156 mc back'
22:44 logmsgbot: brion synchronized php-1.5/mc-pmtpa.php 'swapping in 145 in place of temp down 156'
22:43 tomaszf: rebooting srv156 due to hard down
22:41 logmsgbot: midom synchronized php-1.5/db.php '156 down'
18:26 RobH_A90: srv92 back up with replaced fans
18:16 RobH_A90: srv92 down due to bad fan, going to replace it with another dead server fan, yay for frankenstien servers
18:09 RobH_A90: db17 has had a cold reset, service processors are now responsive, as well as system
09:21 logmsgbot: tstarling synchronized php-1.5/languages/Language.php
June 24
16:27 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19364'
15:56 Fred: modified /home/wikipedia/conf/nagios/sync to include new nagios server in the sync/restart process.
15:55 Fred: cleaned up /home/wikipedia/conf/nagios/conf.php to remove unused server / conf parameters.
15:54 Fred: Cleaned up dsh node_group and resynched Nagios.
10:26 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'livehack upper limit, as mediawiki devs are lazy.. oh wait.'
08:29 domas: compressed databases on db3-5 using lzip, freed up the space. :)
08:16 logmsgbot: midom synchronized php-1.5/db.php 'db24 back to life'
06:28 Tim: updated GeSHi to 1.0.8.4
05:54 Tim: removed all SVN externals from the MW working copy. Updated extensions/SyntaxHighlight_GeSHi to r52346. Scapping.
June 23
22:00 Fred: make && deploy of new squid configuration with added acl for spence.w.o
21:07 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19364 Set project namespace of Portuguese Wikibooks'
20:49 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Fixing issues with arwiki namespacing and formatting'
19:33 Rob: checked up on the updateTitles running against enwiki on hume. On 11864380 / 23266362
19:31 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19357 Create new namespace on arwiki'
19:22 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
19:14 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 19116 adding namespace aliases for itwikibooks'
15:39 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19328 set for fywikibooks - typosssss'
15:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19328 set for fywikibooks'
15:24 logmsgbot: robh synchronized php-1.5/abusefilter.php 'Updated with Andrew'
15:15 Rob: wrong reason listed in sync, opps, was for bug 19272
15:15 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'extensions/AbuseFilter/abusefilter.tables.sql'
15:14 logmsgbot: robh synchronized php-1.5/abusefilter.php '19274 Enable AbuseFilter on Lithuanian Wikipedia.'
08:23 logmsgbot: midom synchronized php-1.5/db.php 'add db28 to s1'
08:20 logmsgbot: midom synchronized php-1.5/db.php 'db19 going live'
00:09 brion: obsoleted useless "web browser" custom field on bugzilla. Doesn't appear in search, hardcoded list would need to be maintained, generally not useful.
June 22
23:27 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Changed wgRC2UDPPrefix for usability.wikimedia'
20:02 tomaszf: adding stats.m.wikipedia.org for hcatlin
20:01 tomaszf: pkill'd pdns on ns1 due to zombie and defunct procs
19:31 JeLuF: removed the refresh_pattern from ortelius' squid config
19:30 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Trying to enable Nuke on enwiki'
19:12 JeLuF: firefishy switched OSM DNS to deliver tiles via ortelius
17:45 JeLuF: added service IP for tiles.wikimedia.org
16:51 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'Setting CC by-sa 3 $wgRightsUrl/$wgRightsText ... see if anything explodes.'
16:15 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php '19309 Creation of "autopatroller" usergroup on en.wikipedia'
16:00 Fred: restarted apache on srv208, srv132 as well
15:59 Fred: restarted apache on srv182
June 21
21:28 logmsgbot: andrew synchronized php-1.5/lucene.php 'switch enwiki mwsuggest to lucene backend'
20:57 logmsgbot: andrew synchronized php-1.5/lucene.php
11:30 domas: copying db3->db28
11:21 logmsgbot: midom synchronized php-1.5/db.php 'depooling db24, was not in replication, IT IS GODDAMN SNAPSHOT SLAVE'
11:20 logmsgbot: midom synchronized php-1.5/db.php 'repooling db15, db24, depooling db3, db4'
11:08 logmsgbot: midom synchronized php-1.5/db.php 'db5 off, will use as db19 copy source'
June 20
19:06 domas: pdf1 mw-serve had segfaulting python processes, haha! kill + /etc/init.d/mwserve start seems to have helped.
08:18 logmsgbot: midom synchronized php-1.5/db.php 'removing db3,db4,db5, will be rebuilt into national slaves'
June 19
23:54 logmsgbot: ariel synchronized php-1.5/CommonSettings.php 'Test-push - No Change'
23:53 logmsgbot: ariel synchronized php-1.5/CommonSettings.php 'Test-push - No Change'
22:24 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php '18902 Enable collectionsaveascommunitypage and collectionsaveasuserpage on Default'
17:13 Fred: exim configuration on Sanger updated. Details can be found on
16:08 RobH_A90: shutting down eiximenis for ram upgrade (thus m.* will be down until it is back online)
03:18 logmsgbot: tstarling synchronized php-1.5/includes/SkinTemplate.php 'fix for fatal error'
June 18
20:16 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Live-merged r52141, which fixes broken page-moves on Wikimedia sites'
20:11 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Less agressive testing code as I can't immediately reproduce the bug. Still loading extension messages as a live hack so we can see wtf is going on when a user reports the bug'
20:01 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Testing code for unbreaking page moving'
16:46 Fred: starting the upgrade process for all apache boxen.
16:24 Fred: rebooting srv224 to test dist upgrade.
16:16 mark: Ran rm -rf /var/tmp/texvc on all apaches
16:13 mark: Upgraded wikimedia-task-appserver to 1.39 on srv76
15:40 logmsgbot: mark synchronized all.dblist
15:38 logmsgbot: mark synchronized all.dblist
15:26 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'updating logo for ruwikimedia'
15:07 Tim: running schema updates on ruwikimedia
14:57 Rob: correction, did not remove, merely set to tfalse
14:57 Rob: removed some apaches from pybal config since they are not receiving updates
14:46 logmsgbot: robh synchronized php-1.5/flaggedrevs.php
14:05 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialUserrights.php 'r52116'
08:00 logmsgbot: tstarling synchronized php-1.5/extensions/FlaggedRevs/FlaggedRevs.class.php
07:45 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php 'srv92 down'
07:24 Tim: scap to r52088
05:26 JeLuF: reinstalled ortelius (formerly known as ptolemy)
05:24 Tim: amane had wikimedia-task-appserver 1.33, was reporting it was "kept back". Ran apt-get dist-upgrade to fix.
04:33 Tim: db15 lagged due to schema updates, depooling again
04:33 logmsgbot: tstarling synchronized php-1.5/db.php
03:46 logmsgbot: tstarling synchronized php-1.5/db.php
June 17
21:27 JeLuF: ptolemy squid test setup running, needs some fine tuning (esp. statistics)
19:48 JeLuF: added ptolemy in dhcpd configuration
19:27 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Deployed r52047, fix for AbuseFilter parser fatals'
19:20 logmsgbot: andrew synchronized php-1.5/includes/HTMLForm.php 'Deploying r52070, string/int inconsistency in XmlSelect value/default breaking imagesize option'
16:43 Fred: rebooting srv101 to finish install.
16:35 logmsgbot: andrew synchronized php-1.5/api.php 'Mistake in previous sync'
16:23 logmsgbot: fvassard synchronized php-1.5/mc-pmtpa.php 'srv75 is down. Replacing with spare: srv97'
16:22 logmsgbot: andrew synchronized php-1.5/api.php 'Fix API for secure.wikimedia.org with ugly live-hack'
16:17 Fred: srv101 install from Monday not completed. Finishing it now. (yes this is a memcached node as well)
16:02 logmsgbot: andrew synchronized php-1.5/api.php 'Debugging for bug 19263'
13:54 logmsgbot: tstarling synchronized php-1.5/flaggedrevs.php 'removed bug 19207 workaround'
13:40 logmsgbot: tstarling synchronized php-1.5/extensions/TrustedXFF/trusted-xff.cdb
13:30 Tim: running schema changes on db15
09:46 Tim: scap at r52034
09:32 Tim: updating to r52031
06:04 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'enabled djvutxt'
June 16
20:56 logmsgbot: andrew synchronized php-1.5/includes/Preferences.php 'Fix bug 19237, broken Preferences page for some languages (e.g. cs)'
18:37 Fred: irc.wikimedia.org is back in service. Channel list is growing. Everything seems to be working as expected.
18:02 Fred: upgrading irc.wikimedia.org. Server will be offline for a couple of minutes.
15:37 logmsgbot: tstarling synchronized php-1.5/flaggedrevs.php 'granted all bots the review right to work around bug 19207'
15:36 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Blocking an email address which has been spamming ascii art to admins'
14:54 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Fix for backwards-incompatibility in AbuseFilter list handling'
10:10 logmsgbot: andrew synchronized php-1.5/extensions/CodeReview/CodeRevision.php 'Live-merging r51955 "CodeReview doesn't load messages when sending e-mail notifications of follow-up revs"'
09:44 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialUserrights.php 'Live-merged r51952, fix for Special:GlobalGroupMembership'
00:42 Tim: scap
00:41 Tim: svn up to r51943
June 15
23:05 logmsgbot: tstarling synchronized php-1.5/flaggedrevs.php 'autoreview bot edits'
20:30 mark: Fixed puppet config of srv35
14:47 logmsgbot: tstarling synchronized php-1.5/includes/ImageFunctions.php 'docs'
14:45 logmsgbot: tstarling synchronized php-1.5/languages/Language.php 'removed hack'
14:41 logmsgbot: tstarling synchronized php-1.5/includes/filerepo/FSRepo.php 'documented hack'
13:58 Tim: updated core to r51904, will scap
13:30 logmsgbot: tstarling synchronized php-1.5/serialized/MessagesMr.ser
13:30 logmsgbot: tstarling synchronized php-1.5/languages/messages/MessagesMr.php
12:59 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php
12:54 logmsgbot: tstarling synchronized php-1.5/languages/Language.php
12:35 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php
12:05 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php 'debugging'
12:03 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php 'debugging'
11:54 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php 'debugging'
11:44 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php 'debugging'
11:32 logmsgbot: tstarling synchronized php-1.5/includes/Skin.php 'r51882'
11:29 Tim: created missing table code_bugs on mediawikiwiki
11:27 logmsgbot: tstarling synchronized php-1.5/includes/ChangeTags.php 'fixed change_tag index name (2nd attempt)'
11:25 logmsgbot: tstarling synchronized php-1.5/includes/ChangeTags.php 'fixed change_tag index name'
10:08 Tim: merged r51871
10:08 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialSearch.php
09:58 Tim: rebooting db17 via mgmt
09:57 logmsgbot: tstarling synchronized php-1.5/db.php 'depooling db17, went down due to scap'
09:49 Tim: srv159 swapdeath, rebooting using management interface
09:46 logmsgbot: tstarling synchronized php-1.5/includes/ImageFunctions.php 'disabled bad image list'
09:37 logmsgbot: tstarling synchronized php-1.5/includes/filerepo/FSRepo.php 'disabled fileExistsBatch'
09:32 Tim: s2 read/write
09:25 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db13, seems to have fixed itself'
09:19 Tim: done critical schema updates on db15
09:12 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialRecentchanges.php
08:56 logmsgbot: tstarling synchronized php-1.5/db.php 'making db8 the fake master on s2'
08:44 logmsgbot: tstarling synchronized php-1.5/db.php
08:25 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialRecentchanges.php
08:21 logmsgbot: tstarling synchronized php-1.5/db.php
08:20 logmsgbot: tstarling synchronized php-1.5/db.php
08:19 logmsgbot: tstarling synchronized php-1.5/db.php
08:19 Tim: switching s2 master from db15 to db13, db15 is missing schema updates
08:09 Tim: scap
07:13 Tim: restarted test.wikipedia.org
07:08 Tim: svn update to r51863
06:49 Tim: shut down test.wikipedia.org to avoid issues with the NFS copy
06:04 Tim: preparing for scap to ~r51860. Made backup in lazy-backups/php-1.5-2009-06-15. Will remove merged changes to bring php-1.5 back to near r48811 plus hacks, to reduce conflicts on svn up.
June 14
17:05 logmsgbot: andrew synchronized php-1.5/CommonSettings.php
June 13
22:13 logmsgbot: andrew synchronized php-1.5/CommonSettings.php
21:54 logmsgbot: kate synchronized php-1.5/db.php
21:54 river: taking db24 out of rotation to dump s2 for TS
05:07 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'added wmfLoadInitialiseSettings definition'
05:06 logmsgbot: tstarling synchronized php-1.5/wgConf.php 'removed wmfLoadInitialiseSettings definition'
June 12
04:13 logmsgbot: tstarling synchronized php-1.5/wgConf.php 'updated for scaptrap r51333'
June 11
22:46 Fred: Kicking srv156 has it has gone unresponsive
16:06 Rob: rolled backups and upgrades to all corporate blogs (newsblog, whygive, & techblog). All upgrades test successful with no visible issues.
15:39 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18612'
14:41 Rob: updated dns for russian chapter url
14:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 14731'
07:48 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '#19149 bgwiki autopatrol group'
06:14 Tim: added srv76 to mediawiki-installation and ran sync-common, was rogue
02:13 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db16, db13, db11'
June 10
18:59 logmsgbot: midom synchronized php-1.5/db.php 'reduced load on snapshot nodes'
12:02 Tim: master switches done, everything should be r/w. Doing schema changes now, toolserver needs to wait for these to be logged before it switches.
12:01 logmsgbot: tstarling synchronized php-1.5/db.php
11:59 logmsgbot: tstarling synchronized php-1.5/db.php
11:55 logmsgbot: tstarling synchronized php-1.5/db.php
11:53 logmsgbot: tstarling synchronized php-1.5/db.php
11:52 logmsgbot: tstarling synchronized php-1.5/db.php
11:51 logmsgbot: tstarling synchronized php-1.5/db.php
11:50 Tim: doing master switches, s1 -> db14, s2 -> db15, s3 -> db18
08:11 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'added 1.15 to ExtensionDistributor'
07:58 logmsgbot: midom synchronized php-1.5/includes/Article.php
07:53 logmsgbot: midom synchronized php-1.5/includes/Article.php
07:50 logmsgbot: midom synchronized php-1.5/CommonSettings.php 'added one more log'
07:05 logmsgbot: midom synchronized php-1.5/includes/GlobalFunctions.php 'adding messages profiling hook, the usual one'
June 9
21:31 mark: Cleaned up stale route-maps on csw1-esams
21:08 mark: Removed unnecessary static routes to esams VLANs on br1-knams
21:01 mark: Migrated IPv4 traffic onto DF leg 1, altered static routes on br1-knams as well
20:20 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '#19138 - arwiki group settings'
20:19 mark: Set up IPv6 iBGP session between csw1-esams and br1-knams over DF leg 1
20:18 mark: Set up new link between csw1-esams and br1-knams over our first dark fiber leg
20:16 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '#19142 - fiwiki wgBlockAllowsUTEdit=true'
17:14 logmsgbot: tstarling synchronized php-1.5/db.php 'moved db18 back to regular s3'
15:50 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db1, db14, lomaria'
14:54 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 18594'
14:54 logmsgbot: robh synchronized php-1.5/flaggedrevs.php 'bug 18594'
14:20 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18905 Enable recent changes patrol on Bulgarian Wikipedia'
14:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '14276 Enable patrol function for non-sysops on Turkish Wikipedia'
14:04 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18194 Enable NewUserMessage extension on Arabic Wikisource'
13:53 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18590'
13:48 logmsgbot: tstarling synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewEdit.php 'fix bug 19135'
09:15 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled db1, db14, lomaria'
09:02 Tim: installed ganglia on lomaria
08:56 logmsgbot: tstarling synchronized php-1.5/db.php 'moving db18 into temporary fr/ja role to replace db1'
08:52 Tim: installed ganglia on db17
08:42 logmsgbot: tstarling synchronized php-1.5/db.php 'reassigned db24 back to s2'
08:35 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled the remainder of the current batch'
06:09 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled thistle'
05:44 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled ixia for schema updates'
04:51 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled db22, db26, db30, thistle, db25, db29 for schema change, warming db24 for commons role'
June 8
21:44 Andrew: reports of missing column af_global reported by AbuseFilterViewEdit.php, in /h/w/logs/dberror.log. Used ddsh to check checksums for that file on all servers, no differences from the version under /h/w/c, which had no mention of the offending column.
16:48 logmsgbot: tstarling synchronized php-1.5/db.php
16:47 logmsgbot: tstarling synchronized php-1.5/db.php
16:19 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled the current batch of servers. db24 is still lagged.'
16:07 logmsgbot: tstarling synchronized php-1.5/db.php
10:48 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled db4, db7, db23, db24, db18, db21 for schema updates. db12 will do the enwiki query groups.'
10:38 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db12, db3, db8, db15'
10:11 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db5, db17'
10:01 Tim: adding AbuseFilter tables to all wikis that don't have them
09:59 logmsgbot: tstarling synchronized php-1.5/extensions/AbuseFilter/abusefilter.tables.sql
08:54 logmsgbot: tstarling synchronized php-1.5/db.php 'removed db12 from query groups (was 1%)'
07:27 Tim: depooled db12, db3, db8, db15, db17 for schema updates
07:27 logmsgbot: tstarling synchronized php-1.5/db.php
06:46 Tim: running DB updates on db5. All updates today are done by maintenance/update-2009-06-08.php running in a screen on zwinger
06:16 logmsgbot: tstarling synchronized php-1.5/maintenance/archives/patch-log_user_text.sql
06:12 logmsgbot: tstarling synchronized php-1.5/maintenance/archives/patch-log_user_text.sql
06:01 Tim: depooled db5 for schema update. Makes a good guinea pig since it has the lowest disk free space.
06:00 logmsgbot: tstarling synchronized php-1.5/db.php
05:51 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db29, depooled on Feb 2 by Domas "for some testing"'
05:43 logmsgbot: tstarling synchronized php-1.5/db.php
05:38 Tim: fixed nagios configuration, had many errors preventing sync
05:23 Tim: disk space critical on storage2. Deleted ~600 GB of files from 2008: all 2008 backups except those that come from wikis that are not in all.dblist
05:04 Tim: stopped ES slave on srv171, disk critical. ms2/ms3 have been reasonably stable replacements.
04:58 Tim: cleaned up binlogs on db13, was disk space critical
June 7
23:02 Andrew: Usability prototype wiki was insanely slow because it ran out of memory and swapped, and then ran out of swap. Looks to have been one rogue PHP process, which I killed. Restarted apache (it had been killed by the kernel to free memory).
13:26 JeLuF: PDF generation checked, seems to be working
13:24 JeLuF: PATH setting was missing in the startup script of mw-serve on pdf1. Added it in line 17.
13:16 JeLuF: pdf generation (Extension:Collection) is broken. Server restart didn't help
June 6
12:41 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '18397 itwiktionary suppressredirect permission'
12:21 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '19106 dawiki rollback permission'
June 5
20:11 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '18341 plwiktionary namespace alias'
20:05 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '18588 Fix pt namespace alias'
20:01 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '18985 wgBlockAllowsUTEdit for ptwiki'
19:51 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '(19041) NewUserMessage extension for rowiki'
19:48 logmsgbot: jeluf synchronized php-1.5/CommonSettings.php 'removed obsolete idwiki account creation throttle'
June 4
18:48 Rob: all active memcached servers now online
18:47 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'decommissioned srv90 due to bad fans'
18:46 Rob: fixing memcached
18:43 Rob: pulled srv90 and srv67 for decommissioning
18:43 mark: Rebooting iris which has a bad disk
18:31 Rob: replaced disk in both sq26 and sq47
18:01 Rob: shutdown sq47 for bad disk
18:01 Rob: shutting down sq26 to replace bad disk
17:01 Fred: temporarily re-enabled deletion for OTRS while the Junk queue is getting cleaned.
13:09 mark: Moved back traffic to esams
June 3
20:20 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '17050 Change upload rights at Russian Wikipedia'
18:47 Rob: pushed update to planet for new inclusions (and removed some crap)
18:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 18594'
17:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18874 Add enwiki as import source at mediawikiwiki'
17:42 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18591'
16:53 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18588 Create a namespace aliases on yuewiki'
16:28 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18956 Import sources on el.wikiversity, forgot a source'
16:27 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18956 Import sources on el.wikiversity'
16:11 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '17967 Alias of Wikibooks namespace in Chinese Wikibooks'
15:51 Rob: running initStats.php against commonswiki per bug 17802
15:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '16180 Upload and Transwiki settings for Japanese Wikiversity'
14:48 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '13853 Setup new groups in no.wikibooks'
14:29 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18590 Add autoeditor group and remove autopromote on Ukrainian Wiktionary'
14:19 Rob: ran sync-common-all to update cluster, enabling flaggedrevs on eswikinews
06:37 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php
06:36 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'moved wgSkipSkins to InitialiseSettings.php and added vector'
June 2
05:00 logmsgbot: tstarling synchronized php-1.5/extensions/Collection/Collection.templates.php 'deployed r51327'
01:14 rainman_: ran
salsa
to update to latest search logging code
June 1
23:37 tomaszf: cleaning up space on storage2. once dumps are being cycled free space will come back
20:38 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18876 Create new namespace for Korean Wiktionary'
20:21 logmsgbot: robh synchronized php-1.5/CommonSettings.php '16961 Activate watchcreations on Commons'
20:09 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18369 Add an import source for de.wikisource'
19:59 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
19:59 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18437 Subpages not activated on enwiki ns 13'
19:59 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18437 Subpages not activated on enwiki ns 13'
19:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '13451 Set new user groups for bswiki'
17:01 Rob: Fred pushed DNS to add bugzilla upgrade installation url
15:58 logmsgbot: robh synchronized php-1.5/flaggedrevs.php
15:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19013 Request for Extension:Collection on hr Wikipedia'
15:31 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18217 Additional namespace aliases for cuwiki'
15:13 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 17730'
15:11 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18329 New namespace alias in ca.wiki'
15:07 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '17694 Enable subpages on Lithuanian Wikipedia template namespace'
15:03 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '6633 Enable transwiki import for the Hebrew projects'
14:58 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '9907 Change Portal talk to Perbincangan Portal on mswiki'
14:50 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18865 Please, dissable local upload of files on es.wikibooks.'
May 31
16:05 logmsgbot: andrew synchronized php-1.5/includes/IP.php 'Live-merging r51236-7, fixes for IP::isInRange, which was broken.'
10:51 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Activating AbuseFilter on ukwiki'
10:50 logmsgbot: andrew synchronized php-1.5/abusefilter.php 'Activating AbuseFilter on ukwiki'
May 30
20:14 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Improve logging for hacked-in hook'
17:26 mark: Manually fixed up srv51, srv52, srv55
17:06 mark: There were two duplicate instances of pybal running on lvs3, killed both and restarted
17:03 mark: Puppetised srv90-99
16:53 mark: Puppetised srv80-89
16:45 mark: Puppetised srv71-79
16:23 mark: Puppetised srv61-srv70
16:08 mark: Puppetised srv51-60
16:04 mark: Installed srv57 and srv58 as application servers
15:53 mark: Installed srv56 as application server
15:42 mark: Installed srv42 as application server
15:36 mark: Rebooted srv42
15:30 mark: Fixed test.wikipedia.org, reinstalled wikimedia-nis-client
14:51 mark: Puppetised srv48-srv50
14:01 mark: Puppetised srv38-41
13:55 mark: Installed puppetd on all application servers
13:32 mark: Puppetised, dist-upgraded & rebooted srv37
13:25 mark: dist-upgraded & rebooted srv32, srv36
13:16 mark: Puppetised srv35 and srv36, dist-upgraded & rebooted srv35
12:28 mark: Repooled srv32-srv33, srv121-123
12:27 mark: Puppetised srv34
12:09 mark: Installed srv120 as appserver
May 29
23:27 tomaszf: restarting apache on hume for static.wikipedia.org to clean out old dead/lazy/useless workers
21:29 Rob: bad file permissions on morebots caused it to not restart. This explains the huge gap in log entries. we were making them in IRC and no bot was logging it
21:28 Rob: blah
21:25 Rob: damned morebots
00:10 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Overriding moving for users on usabilitywiky for sysop as well
May 28
23:16 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Overriding moving for users on usabilitywiki (again, but with the right attribute this time)'
23:09 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Overriding moving for users on usabilitywiki'
22:04 rainman_: enabled lucene-search 2.1 on all wikis, still needs more tweaking to use available resources more efficiently, but leaving that for tomorrow
20:59 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Enabled PdfHandler on usability'
20:58 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Enabled PdfHandler on usability'
20:50 Fred: updated apache cluster with wikimedia-task-appserver 1.38
20:22 logmsgbot: andrew synchronized php-1.5/lucene.php 'Swapped out search11, replaced with search12 and added $wgLuceneSearchVersion = 2.1;. Rainman made me do it!'
19:53 logmsgbot: andrew synchronized php-1.5/lucene.php 'Reverted last change, rainman told me to do it'
19:45 logmsgbot: andrew synchronized php-1.5/lucene.php 'Swapped out search11, replaced with search12 and added $wgLuceneSearchVersion = 2.1;. Rainman made me do it!'
17:42 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Logging for previously added hook for blocking mail spam'
17:20 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18815 Create Portal namespace on Swahili Wikipedia'
17:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '16217 Official Korean name for Wikisource'
16:54 logmsgbot: andrew synchronized php-1.5/CommonSettings.php
15:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18835 Activation of subpages feature on frwiki ns=15'
15:03 Rob: updated sync-common-all to mimic the logging of sync-file so I do not have to manually enter admin log entries for it anymore
14:59 Rob: ran updateAutoPromote and updateLinks in flaggedrevs maintenance for iawiki
14:58 Rob: ran sync-common-all to enable flaggedrevs on iawiki
14:28 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18897 Change logo for Bulgarian Wikisource'
14:23 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Hacked in a hook to stop some clown from email-spamming'
10:09 Tim: deploying r51105 to stop people from emailling via tor
10:08 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialEmailuser.php
07:40 logmsgbot: midom synchronized php-1.5/db.php 'db26 going live'
00:31 rainman_: deployed udp search query logging on searchidx1 and search1-12
May 27
20:37 tomaszf: killing long running query. bugzilla is all well again
20:33 tomaszf: cleaning out numerous locks on db9 causing huge slowdown for bugzilla3 db
08:48 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'AbuseFilter on English Wikinews'
May 26
22:51 river: deleted some old snapshots on ms1
22:45 Fred: fixed dsh a tad, taking care of a few host which keys have been changed and hosts missing keys...
22:28 Fred: removed stale amane NFS mount on all apache boxes.
22:17 Andrew: added 'Change Tagging' component to bugzilla
22:05 brion: adminned werdna on bugzilla to help w/ component config etc
18:17 rainman-sr: all of search_4 wikis temporarely moved to search12 server, preparing search11 for taking its place with full set of lucene-search 2.1 features
17:51 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18589 Enable AbuseFilter on Cantonese Wikipedia'
17:51 logmsgbot: robh synchronized php-1.5/abusefilter.php 'updates on bug 18589 Enable AbuseFilter on Cantonese Wikipedia'
17:47 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18938 Please activate Rollback on Simple.Wikibooks'
17:27 logmsgbot: andrew synchronized php-1.5/lucene.php 'Rainman made me do it! (taking search11 out of rotation)'
16:45 rainman-sr: running initial warmup and start of lsearchd on search12
16:38 Rob: search12 memory upgraded and rebooted
16:37 rainman-sr: running initial warmup and restart on search11
16:32 Rob: search11 ram upgraded and rebooted.
16:04 Rob: srv217 shutdown until I can poke it
16:01 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'removed "xx" from $wgStyleVersion'
15:36 Rob: decommissioning out of warranty and broken servers: srv53, srv67, srv85, srv88, srv90.
15:22 Rob: db28 fan controller board replaced (not db26, typo) system is now online
15:21 Rob: db26 fan controller board replaced, system is now online
15:15 Rob: db26 memory replaced, shows all 32 GB now. Restarting
15:13 logmsgbot: andrew synchronized php-1.5/includes/Export.php 'Removing useless free()s breaking partial dumps'
14:20 Rob: shutdown mysql on db26 as it will have the memory replaced in approx. 25 minutes.
13:58 logmsgbot: tstarling synchronized php-1.5/skins/common/block.js
13:56 Tim: merged r50871
13:55 logmsgbot: tstarling synchronized php-1.5/includes/Block.php
08:14 Tim: srv78 had apache running but did not have /mnt/upload5 mounted. Upgraded its wikimedia-task-appserver package and mounted it.
07:23 Tim: set cr_status=new for all revisions with a recent status change by Skizzerz
May 25
21:35 mark: Modified scap scripts to work on /home-less apaches
21:17 logmsgbot: midom synchronized php-1.5/db.php 'replaced db16 with db12 for auxiliary roles'
18:51 mark: Restarted stuck apache on srv187
18:49 mark: db16 got in trouble at 18:35, spiking to 2000 threads. Might be related to the nv_nic_irq bug. It recovered 5 mins afterwards
17:33 mark: Increased COSS cache dirs from 10 GB to 15 GB on knsq16+ esams squids
07:08 domas: brought williams up, was down, console not responding.
May 23
01:07 tomaszf: restarted srv159 due to hard down
May 22
18:03 mark: Increased big object store on all upload squids from 15 GB to 20 GB
15:24 mark: Raised max in-memory object size to 100 kiB on all squids
15:11 mark: Raised cache dir size of large object store on knsq18 from 15 to 20 GB
14:58 mark: Increased cache dir size by 50% on knsq16, and upped max object size in memory from 75 to 100 kB followed by a backend squid restart
14:40 mark: Increased cache dir size by 50% on knsq16, and upped max object size in memory from 75 to 100 kB
10:41 mark: apt-get dist-upgrade on searchidx1
10:29 mark: Rebooting searchidx1
09:40 rainman-sr: zombie java process 1666 on searchidx1 locked in i/o and taking up lots of ram, cannot kill it, searchidx1 need restart
May 21
19:09 Fred: restarted squid process on sq43 as it was not responding properly.
14:08 Tim: updated
Switch master
docs
13:15 domas: db12 RAID set to no-battery write-behind mode:
arcconf SETCACHE 1 LOGICALDRIVE 0 WB noprompt
13:06 Tim: master switch apparently worked perfectly, in and out of read only mode in like 15 seconds
12:54 logmsgbot: tstarling synchronized php-1.5/db.php
12:54 logmsgbot: tstarling synchronized php-1.5/db.php
12:53 logmsgbot: tstarling synchronized php-1.5/db.php 'setup'
12:50 Tim: attempting master switch from db12 to db16 (s1/enwiki) using new switch script
12:44 domas: db12 RAID controller needs battery replacement:
04:46 Tim: added some more singtel proxies to the XFF list
04:45 logmsgbot: tstarling synchronized php-1.5/extensions/TrustedXFF/trusted-xff.cdb
04:45 logmsgbot: tstarling synchronized php-1.5/extensions/TrustedXFF/trusted-hosts.txt
May 20
21:17 tomaszf: restarted wikitech db after InnoDB crash
21:12 Fred: updated all image scaler boxes (srv[43-47,100]) to wikimedia-task-scaler (1.6)
15:50 Fred: restarted apache on srv224 to bring the load back down from 10.
02:43 Tim: patched in r50175 to stop Special:RevisionDelete timing out on pages with lots of inbound links
02:40 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialRevisiondelete.php
02:06 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialRevisiondelete.php
May 19
23:30 mark: Created ldap/nis/puppet account for ariel
15:45 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'fixing abusefilter settings for zh_yuewiki'
13:36 Tim: restarted segfaulting apaches on srv212, srv222, srv184
06:56 Tim: set cr_status='new' for 1500 recent revisions in code_rev
May 18
23:20 logmsgbot: robh synchronized php-1.5/CommonSettings.php 'updating due to my split of abusefilter configuration into its own file'
23:20 logmsgbot: robh synchronized php-1.5/abusefilter.php 'splitting configuartions into smaller specific files to save my sanity'
22:51 logmsgbot: robh synchronized php-1.5/CommonSettings.php '18589 Enable AbuseFilter on Cantonese Wikipedia'
22:51 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18589 Enable AbuseFilter on Cantonese Wikipedia'
22:23 logmsgbot: robh synchronized php-1.5/CommonSettings.php 'typos rock even more'
22:23 logmsgbot: robh synchronized php-1.5/CommonSettings.php 'typos rock'
22:21 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'abusefilter for nlwiki'
22:21 logmsgbot: robh synchronized php-1.5/CommonSettings.php 'abusefilter for nlwiki'
22:04 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'abusefilter nlwiki changes'
22:00 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Adding AppleTouch Icon for Usability'
21:49 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Adding more details for nlwiki abusefilter roles'
21:28 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Finish setup for nlwiki and nlwikibooks to have AbuseFilter'
21:06 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Enabling CommunityVoice extension for Usabilitywiki'
21:06 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Enabling CommunityVoice extension for Usabilitywiki'
18:54 Rob: replaced dead disks in db30 and db19
18:14 logmsgbot: midom synchronized php-1.5/db.php 'rob is a polar bear'
18:11 logmsgbot: midom synchronized php-1.5/db.php
17:17 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'abusefilter on nlwiki'
15:28 logmsgbot: midom synchronized php-1.5/mc-pmtpa.php
14:46 Rob: sync-common-all after bug 18421 Update config of FlaggedRevs for en.wikibooks
11:03 domas: showed pediapress people how to find large unlinked files :)
May 17
20:14 domas: cleaned up IPC semaphores, restarted apache on srv55
May 15
17:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'enabling abusefilter on nlwikibooks per bug 18615'
16:47 Rob: moved usabilitywiki upload from linode server to cluster, all files are now accessible
16:36 Fred: rebooting srv159 as it is hard down (or close to it)
15:30 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18781 Change SITENAME and add namespace alias on Ukrainian Wikiquote'
May 14
21:04 Rob: ran cleanupTitles.php against huwiki, mtwiki, & barwiki
19:56 Rob: reinstalling srv217 just incase its issues are software related (but prolly are not)
15:41 Rob: srv56 online
15:24 Rob: reinstalling srv56
15:15 Rob: db2 reinstalled
15:07 Rob: had to restart the wikitech server again
14:53 Rob: srv42 reinstall done, leaving setup for mark and puppetification
14:51 Rob: srv42 reinstalled, installing packages for apache use
14:31 Rob: taking down srv42 for reinstall
14:27 Rob: replaced disk in adler, reinstalling
11:07 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'enabled wgBlockAllowsUTEdit on jawiki'
May 13
21:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17464 Add Portal namespace for arzwiki'
21:00 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 16290 Creation of namespace Portal at bar.wikipedia.org'
20:28 Rob: took down srv217 cuz its fubar, will troubleshoot in DC tomorrow
19:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 11112 Install DynamicPageList on Incubator'
19:23 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 16254 enable for simple.wikiquote.org'
19:18 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 16523 Creating portal namespace for et.wikipedia.org'
18:46 Rob: kicked srv217 back into service, with a note to test hardware later
18:28 mark: Ran apt-get dist-upgrade on srv143 and rebooted it, to get it back into shape
18:10 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Added Logo path definition for usabilitywiki'
17:51 Rob: pulled some apaches that were not getting config updates out of the cluster, the bad redirects should resolve now
17:45 Rob: pushed out the config for apache to the cluster, now checking to ensure any failed syncs are NOT pooled.
17:41 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Bug 11488 Fix namespace names in the Hungarian localization'
15:43 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18717 namespace addition for nowiktionary'
15:18 Rob: ran rebuildrecentchanges.php against wikimania2010wiki
15:17 Rob: exported wm2009 mediawiki namespace pages and imported them into wm2010 wiki per initial bugzilla request 18740
15:04 Rob: deleted the excess and unwanted imported pages on wm2010, thanks to Casey for compiling the list \o/
14:51 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'Gave the accountcreator group on enwiki tboverride rights, so that they can create otherwise disallowed account names.'
14:47 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18643 Create Thesaurus namespace on Icelandic Wiktionary'
11:42 logmsgbot: midom synchronized php-1.5/db.php 'db26 serving as dump source'
11:40 logmsgbot: midom synchronized php-1.5/db.php
10:51 domas: same disks on db30 went offline again
May 12
20:46 mark: Rerouted traffic from ptmpa to esams via 6939 / 16150, by prepending 16265 3 times on csw1-esams
20:31 mark: Moved esams text LVS to mint
20:22 mark: Mark is making the network all awesome and flawless, for all my fans in #wikipedia-nl
20:20 mark: Brought BGP session to AS13680 back up, only made it worse
20:17 mark: Shutdown BGP session to 13680, we may be saturating that network and therefore experiencing packet loss
20:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'taking out some of the wikimania2010wiki settings that I am not sure about, to see if its causing an issue.'
19:39 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'coping a bunch of settings from wikimania2009wiki to wikimania2010wiki'
19:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
19:34 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
19:32 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Logo update for iowiktionary'
19:30 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'temp update cuz i need to push an upload to a wiki that normally doesnt allow them.'
19:28 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'updated logo for ukwikiquote'
19:25 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'updated logo for ukwikibooks'
19:23 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'logo update for arzwiki'
19:10 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'mkwikisource logo'
18:57 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'I think this should turn on abuse filter for usability wiki...'
18:52 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18517 Enable the Collection extension for creating books on the Alemannic Wikipedia [alswiki]'
18:37 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18430 Enable Collection extension on id.wikipedia.org'
17:09 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
17:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Rob needs to pay closer attention when he is frustrated =P'
17:03 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'forgot to set the server for wikimania2010'
16:58 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Turned on PDF collection on ptwikiversity'
16:52 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'for wikimania2010 wiki'
16:19 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
16:18 Rob: depooled puppet apaches so i can make site changes
16:05 Rob: messing with configs due to puppet and such
13:48 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18519 Please enable transwiki-imports to English Wiktionary from other Wiktionaries.'
01:02 tomaszf: restarting mysqld on db9 to get rid of ram disk
May 11
21:40 mark: Switched srv32, srv33 puppetmaster to sockpuppet
21:13 mark: Puppetified srv121, srv122 and srv123 and installed them as app servers
20:40 brion: making a note for the record that db7 (enwiki watchlist) is lagging sometimes. it's under extra load pulling a dump
06:46 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialUndelete.php 'deploying r50470 to fix bug 18726 (double URL escaping)'
04:12 Tim: starting recompressTracked for all wikis on hume
03:57 Tim: stopped old ES slaves on srv172, srv173, srv184, srv185
03:53 Tim: cleaned up relay logs on db25
May 10
14:06 domas: configured pageview count aggregation on locke, shipment not done yet
May 9
13:30 brion: rebooting usability.wikimedia.org, it got more thoroughly stuck while attempting to restart apche
13:20 brion: poking usability.wikimedia.org ...
04:15 Andrew: usability.wikimedia.org seems to be down
May 8
20:32 logmsgbot: tfinc synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php
18:59 logmsgbot: tfinc synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php
08:54 Tim: amane's root partition filled up due to the cp running in a root screen, copying from NFS to an unmounted mount point /mnt/big-disk. Moving some stuff to the real mount point /mnt/scratch (in another screen)
08:25 Tim: deploying r48837 and r48911 to fix bug 18171 (broken oldimage parameter)
06:11 logmsgbot: midom synchronized php-1.5/db.php 'got to get coffee'
05:56 logmsgbot: midom synchronized php-1.5/db.php
05:55 logmsgbot: midom synchronized php-1.5/db.php
00:08 brion: db12 no longer overloaded with 'too many connections'. very mysterious
00:04 brion: db12
00:04 brion: db errs on en. poking...
May 7
21:36 domas: db30 disks are shown online, array degraded after 'arcconf rescan', not sure what that means
21:33 domas: db19 disk error counts:
(one disk just failed few times entirely, other gets lots of aborts/medium errors, might be related)
21:21 domas: db30.mgmt needs reset (facilitated by physical movements of power cord)
21:03 domas: db30 has _second_ disk death
21:00 domas: db28 FUBAR information:
20:46 domas: db28 fb0.fm1.f1.speed is flapping between 0 and 21100. needs datacenter inspection and/or vendor service.
19:50 domas: db19 has corrupted ibdata, depooling
19:47 domas: bad disk on db19 actually made I/Os time out, thus corrupting relay logs, reset slave seems to have helped.
19:36 logmsgbot: midom synchronized php-1.5/db.php 'db25 needs some load'
18:46 domas: db19 drive failed, needs replacement (you hear, Rob?! :)
17:35 domas: added retry=1 to ProxyPass for secure.wikimedia apaches backend
16:56 domas: enabling mod_deflate (bottom of main.conf) on apaches
16:50 domas: added new singtel subnet to trusted xff
16:50 logmsgbot: midom synchronized php-1.5/extensions/TrustedXFF/trusted-xff.cdb
07:45 domas: reset slave on db18
02:40 david: added localhost IPs (v4 and v6) to relay_from_hosts in exim4.conf on grosley
02:40 david: Truncated /var/log/exim4/paniclog on grosley, which had an old configuration syntax error notice in it
00:00 brion: added an 'editor' group to wikitech so we don't have to make all users sysops to edit until we get round to culling the abuse accounts :)
May 6
18:48 Fred: made a couple of changes to the Bayes processing scripts so that they support people moving the Bayes folders around. Wikitech updated.
May 5
21:41 mark: Started Apache on srv32/srv33
19:11 mark: Stopping Squid processes on yaseo servers
13:58 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18618 enabling collections on a couple of wikis'
00:34 tomaszf: upping srv255 to 8 active running dumps
May 4
18:43 tomaszf: spawning 5 extra dumps from srv225 to see throughput of system
18:43 tomaszf: depooling srv225 from apache node list and adding it as a dumps workers box
10:01 Tim: set max_connections on ms2 to 2000 using SET GLOBAL, to match the value in /etc/mysql/my.cnf
May 3
08:26 domas: synced
to unbreak Special:Log (
08:24 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialLog.php -
May 2
19:16 rainman___: search3 mysteriously rebooted around midnight, starting lsearchd on it now
05:45 Andrew: reports that Special:Log is broken because it's hitting MAX_JOIN_SIZE
00:16 tomaszf: forcing 644 on dumps using 7za on srv31 until ubuntu bug # 370618 is resolved
May 1
19:43 mark: Unshut peering with AS 2529 on br1-knams
19:11 tomaszf: added php normalize library to srv31. running a couple batch dumps to test functionality.
19:00 tomasz: xml back up jobs went haywire last night due to missing libs. time to fix ..
17:27 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'updated for usability wiki'
16:49 Rob: upgraded limesurvey to newest version
16:04 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Allow bureaucrats to set/unset inactive right on private wikis -- private request from cary'
13:14 mark: Installed server sockpuppet
12:59 logmsgbot: andrew synchronized php-1.5/includes/ChangeTags.php 'Live-merged r50104 -- escaping for classes applied to change tags'
05:47 logmsgbot: tstarling synchronized php-1.5/includes/UserMailer.php 'merged r49682'
05:36 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Re-activated tag filtering with the live patch to ChangeTags.php'
05:32 logmsgbot: andrew synchronized php-1.5/includes/ChangeTags.php 'Index has been renamed since it was created on Wikimedia'
05:13 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
05:10 Tim: $wgUseTagFilter=true experimentally
05:10 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
04:54 logmsgbot: tstarling synchronized php-1.5/includes/ChangeTags.php
04:53 logmsgbot: tstarling synchronized php-1.5/includes/ChangeTags.php
04:50 Tim: merging r49068 and r49086
April 30
17:10 Rob: srv143 back online
17:07 Rob: all memcached back online
17:07 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'swapped out srv142'
17:06 Rob: srv143 locked up, restarting
17:05 Rob: srv142 reinstalling
16:52 Rob: srv31 setup and good to go back to tomasz
16:48 Rob: srv31 reinstalled, installing wikimedia-task-appserver package but NOT pooling.
16:39 Rob: srv81 back online
16:25 Rob: upgrading srv31 to ubuntu
16:10 Rob: reinstalling srv81
16:08 Rob: srv130 back online
15:57 domas: db30 has drive failure, needs replacement
15:41 Rob: upgrading srv124 to ubuntu
15:30 Rob: srv127 was readonly, restarted, fsck, back online
15:25 Rob: upgrading srv137 to ubuntu
13:29 river: upgraded ms4/ms6 to solaris 10 update 7
02:34 Tim: reset slave on db3
02:28 Tim: updated /root/.ssh/authorized_keys on all machines identified with a pingscan that allowed a login with nagios's key. Revoked access for nagios, jeronim and kyle.
April 29
21:32 logmsgbot: brion synchronized php-1.5/includes/specials/SpecialExport.php 'merging r50054 fix for recursive depth export'
21:23 Rob: ran namespaceDupes script against mtwiki once the new portal namespaces were created.
21:22 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18498, adding portal and portal talk namespaces'
21:13 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18498, adding metanamespace_talk for mtwiki'
21:12 brion: set up system administrators global group with export depth override right so Trevor can test the batch export
20:49 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18237 enable autopatrolling and improve patrolling user rights on itwiktionary'
19:05 Rob: DHCP services stopped on zwinger and started on khaldun. Khaldun is now the dhcp server as well as the installation server.
14:53 Rob: restarted wikitech and manually ran morebots upon reboot.
04:07 Tim: doing some network scanning to make sure our host lists are up to date
02:15 tomaszf: moving rsync test to ms1 per Tim.
02:36 Tim: removed all remaining obsolete by_ssh* checks from the nagios configuration
02:27 Tim: installed NRPE on amane and adjusted nagios configurator
01:54 tomaszf: testing commons upload of top level storage directory on zwinger to offsite backup.
01:38 Tim: fixed the mediawiki installation on amane: installed wikimedia-task-appserver, disabled apache, ran sync-common, added to ganglia
April 28
18:02 Rob: futzing around with moving dhcp, taking srv209 as my guineapig.
10:58 Tim: re-added srv31 to mediawiki-installation node group, backup task was rogue and generating "missing cluster" exceptions
10:21 logmsgbot: tstarling synchronized php-1.5/includes/ExternalStoreDB.php
10:19 Tim: re-added srv57 to mediawiki-installation, was rogue and causing "unknown cluster" errors
07:59 logmsgbot: tstarling synchronized php-1.5/db.php 'set the new cluster22 to be the sole ES write destination'
07:57 Tim: pdns on bayle is broken, stuck in futex, restarting
07:52 logmsgbot: tstarling synchronized php-1.5/db.php
07:49 logmsgbot: tstarling synchronized php-1.5/db.php 'introducing cluster22 (ms3/ms2)'
07:43 Tim: adding tables called blobs_cluster22 to ms3, for new current text cluster
07:30 Tim: fixed /etc/mysql/debian.cnf on ms3 so that logrotate flush logs can work
02:09 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Rolling out tor changes'
02:07 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Rolling out tor changes, and ipblock-exempt on all wikis'
01:48 Andrew: Updating configuration to cchange tor settings.
April 27
23:42 logmsgbot: tstarling synchronized php-1.5/db.php 'gave the current ES masters some read load'
23:05 Tim: increased connection limit on temp-es* from 100 to 500
18:31 Rob: srv138, srv139, & srv145 reinstalled and online.
18:24 brion: stopped apache and umounted amane from srv184 (ES slave). load is way overloaded for some reason on this box
18:24 Rob: removed amane from mounts on srv184
18:01 Rob: srv145 reinstalling
17:58 Rob: some quirky stuff going on from various memcached hosts being reinstalled and such. Issues seem to be resolved now.
17:55 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removing reinstalling servers'
17:54 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removing reinstalling servers'
17:43 Rob: srv129 back online
17:43 Rob: reinstalling srv138 and srv139
17:24 Rob: srv126 up and online
17:11 Rob: srv126 and srv129 being reinstalled.
17:09 Rob: srv86 and srv87 up and online
16:49 Rob: srv86 and srv87 upgrading to ubuntu
16:42 Rob: srv107 online
16:38 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'Removing srv120-srv123 for other testing'
16:35 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removing srv156'
16:22 Rob: srv120-srv123 reinstalled, NOT online. Base OS, nothing else, passed on to mark for his testing. (Puppet I assume.)
15:48 Rob: srv120-123 going down for reinstallation
15:45 Rob: srv108 and srv109 up and online
15:06 Rob: srv108 and srv109 are in mid-install for ubuntu
15:06 Rob: srv107 wont restart for some reason, adding to tasks to troubleshoot.
15:04 Rob: srv105 and srv106 back up and online
14:56 Rob: srv107-srv109 goin down
14:54 Rob: srv104 back online
14:48 Rob: srv102 and srv103 back up and online
14:43 Rob: srv102-106 reinstalling.
14:29 Rob: srv53 has a bad fan, shutting down until its replaced.
14:20 Rob: srv102-srv109 being upgraded to ubuntu.
11:42 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Updated $wgSitename for ukwikimedia in accordance with IRC request from Michael Peel, a board member'
02:20 Tim: srv53 down, took it out of memcached rotation. Updating the memcached spare list.
02:20 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php
02:12 Tim: fixed rc1 slaves, broken by expire_logs_days on ms3
01:59 Tim: Shut down srv217 for maintenance. Similar timer interrupt issue observed as before: select() syscalls running indefinitely despite a short timeout specified.
01:53 logmsgbot: tstarling synchronized php-1.5/db.php
01:52 Tim: repooled ms3 rc1 instance
01:49 Tim: reset slave on db21, was running out of disk space due to relay logs
01:42 Tim: fixed nagios for srv99, still had its apache check command set to my CGI security vulnerability demonstration, permanently saved in retention.dat despite config changes
01:17 Tim: enabled apport on srv99, to see if I can track down the nagios flapping
00:52 Tim: restarted trackBlobs.php
April 25
23:31 Tim-away: experimentally stopping replication on db3 to check disk load
22:51 logmsgbot: tstarling synchronized php-1.5/db.php 'reduced load on db3'
18:50 mark: Killed long-running SQL query TrackBlobs::trackRevisions query from hume causing db3 to lag heavily
17:22 mark: Stopped Apaches on srv32/srv33 again, as syncs will fail in most cases
16:36 mark: Started /home-less apache on srv33
13:23 mark: Started /home-less apache on srv32
11:03 mark: Kicked srv99 back into submission
10:56 mark: Squid-blocked high-rate scraper which was overloading ES
05:30 Tim-away: fixed conflict markers in extensions/CentralNotice/SpecialNoticeText.php and resynced.
05:30 logmsgbot: tstarling synchronized php-1.5/extensions/CentralNotice/SpecialNoticeText.php
April 24
22:23 rainman__: search back up on all wikis
22:17 logmsgbot: root synchronized php-1.5/lucene.php 'Replacement for reinstalled srv58'
22:15 logmsgbot: brion synchronized php-1.5/secure.php 'fix for thumbs on private ssl access (bug 18475 etc)'
21:19 rainman_: srv58 dead, making all non-major wikis search broken, transfering the service to search11/12....
19:50 Rob: srv90-srv99 ganglia installed.
19:50 Rob: srv97 online
19:47 Rob: srv98 online
19:46 Rob: srv96 online
19:45 Rob: srv99 online
19:42 Rob: srv95 online
19:40 Rob: srv92, srv93, and srv94 back online
19:39 Rob: srv91 back online
19:24 Rob: srv90 online
19:16 Rob: srv90-srv99 reinstalled, currently looping though package installation
18:34 mark: Fixed ganglia by installing the appropriate config files on the (reinstalled) aggregation hosts
18:27 Rob: installed ganglia on all servers reinstalled to ubuntu apache thus far today.
18:27 Rob: srv89 back online
18:17 Rob: srv90-srv99 will be down over the next 30 minutes for ubuntufication.
18:16 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'some spares were actually down'
18:14 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removed the 9x servers for reinstallation'
18:02 Rob: srv84 ubuntufied and online
17:58 Rob: srv83 ubuntufied and online
17:54 Rob: srv82 ubuntufied and online
17:50 Rob: srv81 reinstalled and online
17:47 Rob: srv89 coming down for reinstall
17:44 Rob: srv58 online
17:38 Rob: srv57 online
17:26 Rob: reinstalling srv58
17:16 mark: Set up switchport for srv57 on asw-c4-pmtpa
17:10 Rob: reinstalling srv57
17:08 Rob: srv75 back online
17:02 Rob: srv74 back online
16:47 Rob: srv73 back online as apache
16:45 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removed srv75'
16:43 Rob: srv71, srv72 back online as apache
16:42 Rob: taking down srv75 for reinstall
16:37 logmsgbot: fvassard synchronized php-1.5/mc-pmtpa.php 'swapping out srv72 for srv100 and srv73 for srv101 while srv[72,73] are being ubuntified'
16:26 Rob: srv72, srv73, and srv74 down for reinstallation
16:23 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out srv71 for srv70 and srv74 for srv92 while srv[71,74] are being ubuntified'
16:05 Rob: srv34 back online reinstalled as ubuntu
16:04 Rob: reinstalling srv71
16:04 Fred: restarted apache on srv99
15:21 Rob: srv34 coming down for reinstall
15:13 Rob: amane reinstalled for tomasz
14:59 Rob: amane reinstall started
14:36 rainman-sr: search9,10 also up; everything should be normal again
14:33 Rob: amane shutting down for rain controller work
14:27 rainman-sr: search5-8 back in search pool
14:16 Rob: shutting down search9 & search10 for memory upgrade
14:15 Rob: search7 & search8 memory upgraded, systems rebooted
14:07 Rob: search5 and search6 back online.
14:05 Rob: memory upgrade complete on search5 & search6, rebooted.
14:02 rainman-sr: done with initial index warmup on search3,4, back in rotation
13:59 Rob: search5, search6 shutdown for memory upgrade
13:58 Rob: search4 memory upgraded and system back online
13:55 Rob: search3 ram upgraded and system is back online
13:50 Rob: search3 upgraded, rebooting.
13:44 Rob: shutdown search3 & search4 for memory upgrades
07:18 logmsgbot: tstarling synchronized php-1.5/db.php
03:35 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php
03:35 Andrew: Deployed AbuseFilter to fiwiki
02:51 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php
02:46 Tim: srv127 has corrupted root partition, needs reinstall or repair. Shut down with echo o > /proc/sysrq-trigger.
02:36 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php
02:31 Tim: killed srv124 with /proc/sysrq-trigger. Was very slow on ssh and was giving odd 403 errors via HTTP.
02:21 logmsgbot: tstarling synchronized php-1.5/README
02:12 Andrew: Updated ruwiki abuse filter configuration per bugzilla request.
02:12 logmsgbot: andrew synchronized php-1.5/CommonSettings.php
02:10 Andrew: srv127: rsync: mkstemp "/apache/common/php-1.5/.CommonSettings.php.TRNqkG" failed: Read-only file system (30)
01:15 logmsgbot: tstarling synchronized php-1.5/db.php
01:14 Tim: depooled db3 so that it can finish doing the querycache update without making lots of people wait for a MASTER_POS_WAIT
01:03 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
01:03 Tim: blacklisted Wantedtemplates on enwiki, has been running for more than a day.
00:54 Tim: restarting trackBlobs.php on hume for afwiki and enwiki
April 23
19:05 brion: donate.wikipedia.org redirect borked, going to civicrm instead of public donation pages. server config needs updating
16:54 brion: db3 was lagging a bit; 403s a few minutes ago. catching up nicely now
Note this is from Wantedtemplates recache job
14:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Added namespaces to huwikisource per bug 18557'
14:41 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialUpload.php
14:39 Tim: merged r49775
14:32 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialUpload.php
14:31 Tim: merged r49051
14:13 Tim: fixed nagios labels for esams backup ext store, erroneously labelled as "toolserver"
06:27 Tim: restarted all job runners, ES connection errors weren't killing them
05:43 Tim: shutting down mysql on all fedora ES servers. Will update documentation and node lists to indicate that this is permanent.
05:37 Tim: srv217 did not come up from a soft reboot, but power cycle worked. Before reboot, observed apache2 hanging indefinitely on nanosleep(), but couldn't reproduce a timer issue in other processes. An NFS mount was hanging on stat.
05:13 Tim: rebooting srv217
04:41 Tim: srv217 is hanging on various operations, investigating. Trying to shut down its apache.
04:35 logmsgbot: tstarling synchronized php-1.5/db.php
04:31 Tim: copy done, started cluster18 mysql instance on ms3 using srv104 snapshot, repooled it
02:07 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
01:57 Tim: relaxed wgAccountCreationThrottle on frwiki, presumably the 2006 vandal emergency is over. Disabled it on idwiki for workshop event.
01:45 Tim: copying srv104's data from ms3 to ms2
01:11 Tim: started mysql on srv104
April 22
21:44 tomaszf: db9 is back up. excessive tmpfs file systems removed
21:39 tomaszf: taking outage on db9 to remove tmpfs file systems
11:34 JeLuF: initiated reboot of srv137. dmesg shows no usable information any more.
11:30 JeLuF: srv137 has read-only filesystem. Stopped Apache.
06:03 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php 'Live-merged r49730, typo causing failures in user hiding'
06:02 Andrew: srv137 still seems read-only, srv137: rsync: mkstemp "/apache/common/php-1.5/includes/specials/.SpecialBlockip.php.1QkrKX" failed: Read-only file system (30)
03:14 Tim: copying ES data from srv104 to ms3 using nc tarpipe
03:10 logmsgbot: tstarling synchronized php-1.5/db.php 'depooling srv104 ES'
03:03 Tim: corruption found on cluster18, the copy source server (srv106) is missing lots of rows. Switched back to srv105/104.
03:02 logmsgbot: tstarling synchronized php-1.5/db.php
02:50 logmsgbot: tstarling synchronized php-1.5/includes/Revision.php 'reverted profiling and logging hacks'
02:40 Tim: depooled ms2 ex-fedora instances and shut them down, it can be a backup for now
02:38 logmsgbot: tstarling synchronized php-1.5/db.php
02:33 Tim: deployed the new ms2/ms3 ex-fedora ES configuration
02:32 logmsgbot: tstarling synchronized php-1.5/db.php
02:04 tomaszf: updated CentralNotice to skip over bad messages when generating js.
02:01 Tim: set up ex-fedora mysql instances on both ms2 and ms3, controlled with /etc/init.d/mysql-ex-fedora
01:04 Tim: changed the main mysql instance on ms3 (rc1) to bind to a single IP address instead of *
April 21
19:41 mark: Added grosley.wikimedia.org to local_domains list on grosley's exim.conf, and added appropriate aliases in /etc/aliases
16:35 Andrew: Re-ran rebuildTemplates.php, all seems well now
16:30 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'syncing for fred'
16:30 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out srv88 for srv159 and srv90 for srv198'
16:29 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php 'Switched srv88 for srv159, srv90 for srv198 to fix down memcache nodes'
16:18 azafred: restarted memcached on srv96. Now responding.
16:14 Rob: Fred needs to start logging in as Fred and not as root, bad fred (see it wasnt me this time, bwahahahahahaa)
16:11 Andrew: Fred fixed up some memcached nodes, but no joy with rebuildTemplates
16:10 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out down servers for active ones'
16:09 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out down servers for active ones'
16:01 Rob: srv137 read only, depooled in pybal for apache and rebooting.
15:57 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out down servers for active ones'
14:34 Andrew: rebuildTemplates.php appeared not to help, same problem as before (stopped after a few wikis). Possibly a dodgy memcache node.
14:32 Andrew: ran rebuildTemplates.php metawiki due to reports of appearing in place of the central notice.
05:04 Andrew: Live-merged r49685, fix for unsuppression of usernames on unblock -- some usernames were left stuck suppressed if they were unblocked when the block suppressed their username
05:03 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php
05:03 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialIpblocklist.php
01:34 azafred: Made some improvments on Spam handling. Bayes is in play and can learn from everybody what is spam and what is ham. Documentation to follow.
April 20
19:59 Rob: Powering down srv67, srv85, srv88, srv90 due to temp warnings and bad fans.
19:36 Rob: updated mc-pmtpa.php to reflect the status of down or spare for the memcached servers. (lots more spares now)
17:35 azafred: restarted apache on srv217
17:34 azafred: srv125 reinstall completed.
17:24 Rob: srv146 back online
17:10 Rob: srv131 back up, updated and synced.
16:52 azafred: srv118 reinstall completed.
16:52 Rob: srv127 back online and synced.
16:41 Rob: srv125 reinstalled, passing off to fred
16:40 Rob: replaced dead disk in sq26
16:31 Rob: shutting down sq26 to replace bad hdd
16:27 Rob: reinstalling srv125
16:13 azafred: finished re-install of srv63.
16:11 Rob: reinstalled srv118, handed off to fred for completion
16:01 Rob: restarted srv118 and reinstalled it
15:57 Rob: restarted a locked up srv110 and synced it.
15:49 Rob: srv81 lacked up, fixed, synced and online
15:29 Rob: replaced fan and drive in srv63, reinstalling
14:36 Rob: memory replaced in srv203, back online.
14:11 Rob: shutting down srv203 to swap out bad memory
05:12 Tim: fixed memcached on srv75, stopped old ES slave on srv102, srv106, srv107, srv159, srv171
April 18
14:05 Tim: unblocked 80legs, they promised to be nice
13:56 logmsgbot: tstarling synchronized robots.txt
05:26 azafred: rebooted db20 after / ran out of space and started causing all kind of issues.
April 17
22:49 brion: regenerated centralnotice output again... this time ok
22:48 brion: srv93 and srv107 memcached nodes are running but broken. restarting them...
22:43 brion: restarted srv82 memcache node. attempting to rebuild centralnotices...
22:41 brion: bad memcached node srv82
22:05 mark: Set up 3 new pywikipedia mailing lists, redirected svn commit output to one of them
19:38 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18494 Logo for ln.wiki'
17:22 Rob: removed wikimedia.se from our nameservers as they are using their own.
16:48 azafred: updated spamassassin rules on lily to include the SARE rules and mirror the settings on McHenry.
10:25 logmsgbot: tstarling synchronized robots.txt
08:19 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
07:13 Tim: temporarily killed apache on overloaded ES masters
07:11 logmsgbot: tstarling synchronized php-1.5/db.php 'zeroing read load on ES masters'
06:04 Tim: brief site-wide outage while it rebooted, reason unknown. All good now. Resuming logrotate.
05:55 Tim: db20 h/w reboot
05:48 Tim: shutting down daemons on db20 for pre-emptive reboot. Serial console shows "BUG: soft lockup - CPU#4 stuck for 11s! [rsync:27854]" etc.
05:10 Tim: on db20: killed logrotate -f half done due to alarming kswapd CPU (linked to deadlocked rsync processes). May need a reboot.
05:00 Tim: fixed logrotate on db20, broken since March 10 due to broken status file, most likely due to non-ASCII filenames generated by demux.py. Patched demux.py. Removed everything.log.
02:14 river: set up ms6.esams, copying /export/upload from ms1
00:24 Tim: blocked lots of uci.edu IPs that were collectively doing 20 req/s of expensive API queries, overloading ES
00:15 brion: techblog post on Phorm opt-out is linked from slashdot; load on singer seems fairly stable.
April 16
23:06 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php
22:48 azafred: bounced apache on srv217. All threads were DED - dead
22:16 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php
22:08 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php
17:41 domas: fantastic. I start _looking_ at stuff and it fixes itself.
17:35 logmsgbot: midom synchronized php-1.5/includes/Revision.php 'live profiling hook'
17:28 domas: db20 has kswapd deadlock, needs reboot soonish
17:18 logmsgbot: midom synchronized php-1.5/InitialiseSettings.php 'disabled stats'
17:15 logmsgbot: midom synchronized php-1.5/InitialiseSettings.php 'enabling udp stats'
16:18 azafred: bounced apache on srv217 (no pid file so previous restart did not include this one)
15:57 brion: network borkage between Florida and Amsterdam. Visitors through AMS proxies can't reach sites.
15:55 azafred: bounced apache on srv[73,86,88,93,108,114,139,141,154,181,194,204,213,99]
15:52 Tim-away: started mysqld on srv98,srv122,srv124,srv142,srv106,srv107: done with them for now. srv102 still going.
15:30 mark: Set up ms6 with SP management at ms6.ipmi.esams.wikimedia.org
14:13 mark: Restoring traffic to Amsterdam cluster
14:06 mark: Reloading csw1-esams
13:55 mark: Reloading csw1-esams
13:53 JeLuF: ms1 NFS issues again. Might be load related
13:49 Tim: copying fedora ES data from ms3 to ms2
13:44 JeLuF: ms1 is reachable, no errors logged, NFS daemons running fine. After some minutes, NFS clients were able to access the server again. Root cause unknown.
13:38 JeLuF: ms1 issues. On NFS slaves: "ls: cannot access /mnt/upload5/: Input/output error"
13:24 mark: DNS scenario knams-down for upcoming core switch reboot
08:23 river: pdns on bayle crashed, bindbackend parser seems rather fragile
03:01 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Deployed AbuseFilter to ptwiki'
April 15
22:42 tomaszf: adding ramdisk to db9 to speed up create tmp tables
22:34 mark: PowerDNS got confused by a commented DNS entry and broke zone wikimedia.org, fixed
22:32 brion-codereview: DNS broken. mark's poking it
22:24 mark: Temporarily removed AAAA record from mayflower in DNS
22:14 brion-codereview: db9 tmpfs full, breaking anything using that db
22:00 brion-codereview: ipv6 connectivity broken between isidore & mayflower, breaking codereview SVN updates
20:59 brion: civicrm queries bogging down db9 affecting otrs performance. tom's looking into it
18:24 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'for subpages on ukwikimedia'
17:32 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17898 Wiktionary is a bad interwiki prefix on ukwiktionary and mlwiktionary'
17:25 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'per bug 17773 Install Labeled Section Transclusion for dewikiversity'
14:33 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17718 Disable CentralNotice on private/fishbowl wikis'
14:29 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18434 Enable the rollback feature on Commons'
14:19 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18307 Add autopatrolled group to English Wikisource'
14:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17717 Enable subpages on main namespace of UK chapter website'
13:55 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18428 cswikisource settings updates'
12:38 Tim: restarting copy to ms3
12:25 Tim: rebooting ms3 with 2.6.28 kernel
12:18 Tim: running xfs_check on ms3
12:14 Tim: restarting ms2 with domas's 2.6.28 kernel
12:06 logmsgbot: midom synchronized php-1.5/db.php 'removing db25 - apparently it was down for more than a day'
11:58 domas: db25 went down, resetting
11:08 Tim: ms3 went down, no response on serial console, rebooting
11:05 logmsgbot: tstarling synchronized php-1.5/db.php
08:32 Tim: copy in progress, rsync over ssh controlled via screen on tstarling@zwinger
08:23 Tim: shutting down mysqld on srv98,srv122,srv124,srv142,srv102,srv106,srv107 for data directory copy to ms3
April 14
23:48 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_body.php
23:39 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_body.php
23:37 logmsgbot: tfinc synchronized php-1.5/reporting-setup.php
19:01 Rob: replaced dead drive in ms4
18:41 Rob: srv78 back online
18:37 Rob: srv78 was wonky and such, reinstalled to fix.
18:21 Rob: srv90 reinstalled and redeployed
18:21 Rob: memcached had stopped on srv89, restarted.
18:16 Rob: all fans are good on srv86, bringing back online.
18:13 Rob: srv86 has temp warnings, shutting down to check fans and such
17:59 Rob: reinstalling srv90 from FC to ubuntu
17:52 Rob: replaced bad fan in srv90
17:40 Rob: pulling srv90, overheating warnings.
17:38 Rob: srv85 overheating due to dead fans. server is old and out of warranty, decommissioned but kept on site for parting out.
17:10 Rob: bringing back up sq1, no memory on hand for upgrading these (They are ddr pc3200, all the spare memory we have is ddr2 or ddr 2700)
17:02 Rob: pulling sq1 for memory upgrade.
16:55 Rob: replaced bad patch for search9, LOM functions properly.
16:38 Rob: Upgraded memory in search1 and search2 to a total of 16GB each (previously 8).
16:14 Rob: Had to restart wikitech due to OOM issues, again. Perhaps it is time to up the memory in the machine or tweak settings.
05:05 Andrew: testwiki problem seems to be a squid problem, can get srv123, srv84 to serve the main page with no problems by sending a request through netcat. Trying to connect to rr just gets no response
05:01 Andrew: testwiki seems fubar, timing out on all pageviews.
03:47 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php
03:45 Andrew: Installing AbuseFilter on alswiki
April 12
15:22 logmsgbot: tstarling synchronized php-1.5/extensions/SecurePoll/SecurePoll.i18n.php
12:25 Tim: updated CentralNotice templates manually to get the license vote in the header
09:05 Tim: loading license update messages into SecurePoll jump side with sp-msgs-reduced.sql
April 11
12:05 domas: restarting all ubuntu memcacheds, rolling 1.2.8-4 live
08:31 domas: rebooted srv187 with all the new kernels and such
April 10
08:51 domas: few memcacheds were hitting OOMs, I really have to upgrade them :)
April 9
20:15 JeLuF: started "maintenance/importImages.php" upload of the second batch of Fotothek images to commons
16:37 logmsgbot: tstarling synchronized php-1.5/extensions/SecurePoll/includes/Auth.php
14:43 Tim: php_admin_flag engine on in the SecurePoll directory
14:37 logmsgbot: tstarling synchronized php-1.5/extensions/SecurePoll/includes/VotePage.php
13:55 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'enabling subpages on fiwikiversity'
12:59 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 're-enabled SecurePoll'
12:59 logmsgbot: tstarling synchronized php-1.5/extensions/SecurePoll/includes/Entity.php
12:33 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'enabled SecurePoll'
12:32 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'enabled SecurePoll'
09:04 mark: Restored Amsterdam traffic; problem resolved
08:07 mark: Moving Amsterdam traffic to pmtpa while a power problem at esams is being investigated
05:27 JeLuF: many hosts in esams not reachable any more. Switch outage?
April 8
21:31 Tim: running voterList.php for all wikis on hume, to construct license update voter lists
21:05 Rob: pushed blog.wikimedia.org dns back into the squid cluster
12:13 Tim: installing SecurePoll on WM including IP.php r49117
04:33 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php 'Comment for DOWN nodes that seem to be up'
04:28 Andrew: Only two spare memcached nodes left. Checked all the nodes marked as down, and found that srv126, srv100, srv137, srv92, srv129 seem to be up (tried nc to port 11000, and got ERROR response). Not moving them into the SPARE section in case I'm not doing it right.
04:19 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php 'Memcached on srv143 died, replaced with srv197 (slot 12)'
00:34 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php
00:33 Andrew: deploying AbuseFilter to zhwiki
April 7
18:34 Rob: restarted memcached on srv116
18:26 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18307 Autopatrolled permission on enwikisource.'
16:36 Rob: rmeoved the outdated fundraising blog out of planet
16:27 Rob: singer is back to normal. whygive is fubar, but no one cares ;] The rest of the services are online and functional.
08:28 domas: reset srv217 - where did I hear that again. it was hanging on image NFS and had segfaulting apache too. 7 incidents with it in past two months - needs hardware diagnostics
01:48 Rob: blog.wikimedia.org is now up. singer is kinda mostly fixed, i will finish it in the morning. all sites on it are up.
00:16 Rob: pushed blog.wikimedia.org out of squid via dns
April 6
21:49 Rob: singer is near returned to normal, however the primary corporate blog is doing funny things with caching and apache direction on the server.
21:34 Rob: random insanity with apache on singer, affected corporate blogs, ocs, wm09scholarships, communicate portal, and ...well... thats enough. Still working on resolution.
14:26 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 16178 Activate Collection Extension for generating PDF on the French Wikiversity'
14:18 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17861 logo update at german wikibooks'
13:52 logmsgbot: kate synchronized php-1.5/extensions/CodeReview/CodeReview.php
12:44 domas: reset power for srv187
12:44 domas: restarted hanging apaches on srv90, srv97 and crashlooping ones srv189, srv206
12:35 domas: restarted busylooping memcached on srv143, new bug!
11:47 logmsgbot: kate synchronized php-1.5/extensions/CodeReview/codereview.css
04:57 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php 'srv129 down, swapped it for srv182'
04:54 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php 'Live-merging r49222 -- fix for hiding of logging data in recentchanges. Tests okay on testwiki.'
April 5
19:45 logmsgbot: jeluf synchronized php-1.5/mc-pmtpa.php 'replace srv92 b srv152'
15:07 domas: Tim rocks, cluster back to normal
15:04 Tim: deployed r49212 to fix infinite template recursion issue
11:24 domas: restarted some fedora apaches, were stuck in write() after the previous hiccups
10:28 mark: Restarted memcached on srv151
08:59 domas: I hate computers
08:28 domas: hanging latex processes held :80, thus not allowing clean apache restarts on some nodes
08:18 domas: restarted plentiful of crashlooping apaches, investigating resource consumption problem
05:53 Andrew: srv183 back up and memcached running, moved it from DOWN to SPARE. Threads on db12 back below 1000 again (normal range). Somehow I just resolved my first site problems by myself :O
05:44 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php
05:42 Andrew: srv183 went down, and it's running memcached. Replaced it with srv61, the first one in the SPARE section of mc-pmtpa.php, and moved it out. Hoping I did this right, and that it helps with db12 overload.
05:30 Andrew: db12 (enwiki master) overloaded. Nothing I can do about it.
01:52 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php 'Live-merge r49191, fixes a bug that gets in the way of suppression'
April 4
21:01 azafred: bounced apache on srv82
20:14 azafred: bounced apache on srv50
18:16 azafred: bounced apache on srv72
18:15 azafred: bounced apache on srv115
18:01 azafred: bounced apache on srv121
17:55 azafred: bounced apache on srv71
14:32 azafred: bounced apache on srv147
14:25 domas: restarted memcacheds on srv67, srv112 and srv143 - they seem to have hit reference leak condition (that was probably resolved in memcached 1.2.7)
13:44 brion_: deployed update to wikibugs
13:03 Tim: restarting trackBlobs.php, probably died during db2 crash
09:17 domas: reset-mysql-slave on db23, purged 1-100 logs on db13
05:06 azafred: bounced apache on srv89
04:46 azafred: bounced apache on srv124
April 3
20:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Turned on collections on srwiki, srwikibooks, and srwikisource'
09:16 mark: Power cycled sq34
00:15 Andrew: installed renameuser, updated mediawiki to r48811 on usability
April 1
23:08 Fred: restarted apache on srv99
22:57 mark: Restored session to AS 30217 as well
22:33 mark: Brought session to AS 13680 back up
21:30 mark: Shut down BGP sessions to AS 13680 and 30217 for what appears to be problems to/within Level 3 Tampa
16:10 JeLuF: Image import of about 5000 images done. 245000 left to do...
15:53 Fred: rebooting srv217 since it is wedged.
15:47 Fred: restarted apache on srv137
13:47 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Activating AbuseFilter on tpiwiki and hewiki, bugs 18299, 18300'
13:45 mark: Rebooting srv217
13:45 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'AbuseFilter custom settings, hewiki'
13:40 JeLuF: batch importing images from the Deutsche Fotothek, commons.wikimedia.org/wiki/Commons:Deutsche_Fotothek
13:11 Andrew: Reports that Common.css/Common.js weren't working on hsbwiki. Manually purging
on the command-line fixed the issue.
08:42 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Similar fix for plwiki wgRemoveGroups -- added "bot"'
08:41 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Fix for plwiki wgAddGroups, overriding with array( "abusefilter" ) stopped plwiki bureaucrats from adding other groups'
04:06 river: test x
04:05 Andrew: Works again
04:04 Andrew: Testing re-enabling of identi.ca bridge for morebots
March 31
21:19 aZaFred: restarted apache on srv99
15:19 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewExamine.php 'Plug up the ability of users to run arbitrary filters against edits. Not strictly a security risk, but you could do some nasty things to slow down the servers with a filter (DoS vector).'
15:18 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewTestBatch.php 'Plug up the ability of users to run arbitrary filters against edits. Not strictly a security risk, but you could do some nasty things to slow down the servers with a filter (DoS vector).'
15:18 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.class.php 'Plug up the ability of users to run arbitrary filters against edits. Not strictly a security risk, but you could do some nasty things to slow down the servers with a filter (DoS vector).'
15:17 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.i18n.php 'Plug up the ability of users to run arbitrary filters against edits. Not strictly a security risk, but you could do some nasty things to slow down the servers with a filter (DoS vector).'
05:03 JeLuF: removed 30 GB of binlogs on db17
March 30
23:50 domas: interrupted crash recovery for db2, will do crime scene investigation afterwards :)
23:37 logmsgbot: midom synchronized php-1.5/CommonSettings.php 'enwiki rw with new master'
23:36 logmsgbot: midom synchronized php-1.5/db.php 'welcome new enwiki master, db12 - db12-bin.016 79'
23:27 domas: too many open transactions (?) on enwiki/db2 caused it to go OOM or so...
23:25 logmsgbot: midom synchronized php-1.5/CommonSettings.php
23:23 river: Out of Memory: Killed process 5374 (mysqld).
23:21 Andrew: mysqld on db2 crashed.
15:58 aZaFred: restarted Apache on srv126 and srv103
08:22 domas: memcached on srv60 segfaulted..
05:55 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialUserrights.php 'Live-merging r48993, fix for global group membership form (regression)'
05:55 logmsgbot: andrew synchronized php-1.5/extensions/CentralAuth/SpecialGlobalGroupMembership.php 'Live-merging r48993, fix for global group membership form (regression)'
March 29
14:20 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialWatchlist.php 'sync up watchlist fixes, r49002'
11:20 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php '
temporary
increasing internal RC limit to 5000'
11:15 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'some more efficient joining'
11:09 logmsgbot: midom synchronized php-1.5/includes/ChangesList.php 'livemerging up to 48990'
11:09 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'livemerging up to 48990'
10:35 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Bug 18223, let sysops edit abuse filters on dewiki'
08:51 logmsgbot: midom synchronized php-1.5/CommonSettings.php 'move away the mainpage delete protection to getUserPermissionsErrorsExpensive'
08:41 logmsgbot: midom synchronized php-1.5/includes/Title.php 'merging in c48983'
07:37 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Turning on AbuseFilter on enwikiquote'
March 28
00:56 Tim: fixed permissions on .ssh directories on singer. Converted jforrester's authorized_keys file from RFC to OpenSSH format.
00:08 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialUserrights.php 'Syncing revert of apiuserrights r48909'
00:07 logmsgbot: andrew synchronized php-1.5/includes/User.php 'Syncing revert of apiuserrights r48909'
00:06 logmsgbot: andrew synchronized php-1.5/includes/AutoLoader.php 'Syncing revert of apiuserrights r48909'
00:05 logmsgbot: andrew synchronized php-1.5/includes/api/ApiQueryUsers.php 'Syncing revert of apiuserrights r48909'
00:05 logmsgbot: andrew synchronized php-1.5/includes/api/ApiQueryRecentChanges.php 'Syncing revert of apiuserrights r48909'
00:04 logmsgbot: andrew synchronized php-1.5/includes/api/ApiMain.php 'Syncing revert of apiuserrights r48909'
00:03 logmsgbot: andrew synchronized php-1.5/extensions/CentralAuth/SpecialGlobalGroupMembership.php 'Reverting apiuserrights (r48910)'
March 27
23:29 aZaFred: rebooting srv217 since it is wedged.
22:10 aZaFred: restarted apache on srv203
21:39 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'disabling TIFF->JPEG thumbnailing, doesn't work at present with our setup'
21:38 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'Enabling TIFF->JPEG thumbnailing experimentally'
17:49 brion: enabled upload-by-url for all users on testwiki for wider testing, upped $wgMaxUploadSize to 500MB from default 100
17:44 logmsgbot: brion synchronized php-1.5/includes/specials/SpecialUpload.php
17:43 logmsgbot: brion synchronized php-1.5/includes/DefaultSettings.php
17:43 brion: live-merging r48923 to make CURL timeout for upload-by-url configurable
17:27 brion: enabled xml-rpc publishing on techblog so I can administer from WordPress iPhone app
10:56 mark: Moved server nehalem to vlan 101
02:28 Tim: db29 was full of relay logs, ran RESET SLAVE.
March 26
22:45 logmsgbot: brion synchronized php-1.5/includes/specials/SpecialBlockip.php 'to r48899 - fixes for hiding'
22:42 Danny_B: wikibugs-l stopped to send mails to wikibugs-irc mailbox due to excessive bounces. reenabling sending again
22:05 logmsgbot: midom synchronized php-1.5/db.php 'db1 coming back as frwiki/jawiki slave'
21:49 logmsgbot: brion synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php to r48899
18:36 logmsgbot: brion synchronized php-1.5/languages/LanguageConverter.php
18:36 brion: applying r48836 language converter fix live
18:22 aZaFred: srv224 and srv225 have been kickstarted, deployed and put in rotation.
15:56 Rob: srv224 and srv225 have temp power from A4 until new cables can be made.
15:56 Rob: morebots was dead!
12:45 domas: 10s lock wait timeout doesn't work for parallel data loads %)
09:45 logmsgbot: midom synchronized php-1.5/db.php '*whip* db18, back to work, grunt.'
09:16 logmsgbot: midom synchronized php-1.5/db.php 'letting db12 back into the pool'
08:36 domas: doing firmware/kernel updates on db12
08:36 logmsgbot: midom synchronized php-1.5/db.php
08:32 domas: updating firmware on db18 ILOM: load -source
08:29 domas: rebooting db18 with 2.6.28.2
08:27 domas: db18 problem was 2.6.24, not 'memory use', I guess
03:29 logmsgbot: andrew synchronized php-1.5/extensions/CentralAuth/SpecialCentralAuth.php 'Fixes for strange display on Special:CentralAuth'
03:28 logmsgbot: andrew synchronized php-1.5/extensions/CentralAuth/CentralAuth.i18n.php 'Fixes for strange display on Special:CentralAuth'
02:43 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.class.php 'Deploying contains_any function, radix regex fixes for performance improvements'
02:43 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Deploying contains_any function, radix regex fixes for performance improvements'
02:42 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.i18n.php 'Deploying contains_any function, radix regex fixes for performance improvements'
00:11 Tim: killed flaggedrevs update on zwinger, same reason as last time I did it. This time it actually took down ganglia for a while and made shell access very slow.
March 25
23:03 aZaFred: kickstarted spence to run some test on.
19:09 mark: Allocated port 0/1/13 on asw-a4-sdtpa for the server with the wrong name
19:04 Rob: updated dns for nahalem test server
15:59 logmsgbot: andrew synchronized php-1.5/extensions/FlaggedRevs/specialpages/OldReviewedPages_body.php 'Bug in OldReviewedPages'
15:41 Rob: running the flaggedrevs updatelinks script across the flaggedrevs wiki in a screen session on zwinger
15:30 Andrew: Running php svnImport.php MediaWiki 0 --wiki mediawikikwiki
15:28 logmsgbot: andrew synchronized php-1.5/extensions/CodeReview/CodeRevision.php 'r48831'
14:56 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php
14:44 logmsgbot: andrew synchronized php-1.5/extensions/Collection/Collection.body.php
14:43 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php
14:11 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php
14:08 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php
14:01 logmsgbot: andrew synchronized php-1.5/includes/api/ApiQueryImageInfo.php
14:01 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php 'Yet another fatal'
13:54 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php 'Fatal errors'
13:49 logmsgbot: andrew synchronized php-1.5/includes/api/ApiQueryCategories.php 'Fatal for invalid titles'
13:49 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialUserrights.php 'Fatal for cross-wiki user rights'
13:47 logmsgbot: andrew synchronized php-1.5/extensions/FlaggedRevs/FlaggedRevs.hooks.php 'Returning a value from a hook, causing exceptions.'
13:41 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php 'Fixing a fatal'
13:37 brion: ProofreadPage is borked
13:34 brion: scap complete!
13:23 brion: starting general scap to r48811 -- yay!
13:16 Andrew: Reverted some live hacks for AbuseFilter that were in there because of dependencies on core.
12:48 brion: svn up'd test to r48811 ... last one?
12:31 brion: svn up'ing test to r48810
12:22 brion: disabling Configure extension on testwiki for now; we'll poke at it more later
12:22 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php
12:20 brion: applying DB schema tweaks for flaggedrevs, codereview
12:16 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Trying to get rid of a warning'
11:55 brion: scap time cometh! svn up'ing testwiki for shakedown...
10:26 brion: svn up'ing CodeReview
01:29 Tim: running trackBlobs.php again on hume
01:23 Tim: db18 back up, replicating, 14285s lag
01:18 logmsgbot: tstarling synchronized php-1.5/db.php
01:14 Tim: rebooted ILOM on db18, was refusing to reboot the machine
00:58 Tim: db18 was locked up in kswapd, attempting reboot
00:30 Tim: installed NRPE etc. on ms2 and ms3
00:15 logmsgbot: tstarling synchronized php-1.5/db.php
00:11 logmsgbot: tstarling synchronized php-1.5/db.php
00:09 logmsgbot: tstarling synchronized php-1.5/db.php
March 24
23:57 Tim: changing master for rc1 to ms3. Omitting srv183, which will be removed from the group.
23:46 azafred_: udpated motd on srv32 to reflect its puppeteer status.
23:35 domas: experimented with PG vs MySQL5.0 performance on db28 :)
23:34 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Ack, added to foundationwiki, not fishbowl, reverted, and added to fishbowl.'
23:07 Rob: lowered the memory limit on usability project server
22:52 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Adding an inactive group to foundationwiki'
22:43 brion: usability.wikimedia.org timing out
22:23 azafred_: fixed spamassassin rules compilation on mchenry to speed up the process
22:22 azafred_: Updated spamassassin rules to include more spam definitions on mchenry.
21:36 brion: updating usability.wikimedia.org to current, installing AntiSpoof, AbuseFilter to help clean up vandalism problems
18:59 azafred_: bounced apache on srv201
18:58 azafred_: bounced apache on srv188
18:30 Rob: pushing apache changes for wikizdroje.cz for cs.wikisource.org
17:40 azafred_: restarted apache on srv190
15:30 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18023 Enable Collection extension on svwiki'
15:22 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '//For bug 18061 Remove obsolete settings from cluster config files'
10:45 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'RC cost became too high to support 5000 limit, decreased to 500'
06:56 Tim: stopped mysql on srv171 and srv183 for copy to ms2 and ms3. Depooled.
06:55 logmsgbot: tstarling synchronized php-1.5/db.php
06:53 Tim: installed ganglia on ms2 and ms3, put them in the mysql cluster
04:31 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'I broke plwiki'
04:27 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Bugs 18073, 18094, 18102, AbuseFilter for commons, plwiki, svwiki'
04:22 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Custom AbuseFilter settings for plwiki (bug 18073)'
01:41 mark: Installed Ubuntu on ms2; ms3 and ms2 are now ready for ES usage
March 23
23:32 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'rate limit exemption for usability testing'
22:56 domas: db1 will serve fr/ja soonish
22:56 logmsgbot: midom synchronized php-1.5/db.php 'removing db1'
18:58 Rob: upgraded spam plugin on wikimedia blog (shows old stuff in dashboard due to caching, its updated now)
18:23 aZaFred: Setup access on LDAP and NIS for fvassard.
13:36 logmsgbot: midom synchronized php-1.5/db.php 'rename s2a into s2dewiki, add s2commons with single primary server ixia, enable ixia with commons-only dataset, *poof*'
00:34 brion-weekend: customized style a bit on
techblog.wikimedia.org
March 22
23:14 Tim: killed long-running tidy instance on srv108
22:10 Tim: restarted memcached on srv82
12:15 domas: dumping commonswiki from db30 with server still pooled in and 4 dumper threads
10:38 domas: hey, I found remote code execution vulnerability, it seems! :)
10:37 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php '--message=Arbitrary execution vulnerability in AbuseFilter, exploitable only by admins'
06:00 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled ixia, is full'
05:54 Tim: db5 was running out of disk space due to excessive relay logs. Ran RESET SLAVE.
03:02 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewExamine.php '18096 Special:AbuseFilter/examine doesn't list new account creation log entries'
02:47 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.class.php 'Fixed bug in batch testing interface'
02:41 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Optimisation of rmdoubles, causes 20-fold performance improvement on large pages'
02:23 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewDiff.php 'Prevent leaking of private filters through diff interface'
March 21
17:30 logmsgbot: midom synchronized php-1.5/db.php 'someone forgot to enable db8.. putting back to pool'
17:29 logmsgbot: midom synchronized php-1.5/db.php 'bringing back db26'
10:46 logmsgbot: midom synchronized php-1.5/db.php 'taking out db26 for kernel experiments (2.6.28.8 with some different build options)'
00:12 brion: upload-by-URL enabled for sysops on testwiki (using
khaldun
as internal proxy)
00:07 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php
00:04 logmsgbot: brion synchronized php-1.5/extensions/MWSearch/MWSearch_body.php
00:04 logmsgbot: brion synchronized php-1.5/includes/specials/SpecialUpload.php
00:03 brion: live-merging r48648 to allow $wgHTTPProxy to work for uploads and not interfere with search
March 20
23:58 Tim: repooled srv126, part of cluster12, appears to be up and working
23:58 logmsgbot: tstarling synchronized php-1.5/db.php
23:54 Tim: depooled srv125 from ES, has been down for 12 days. cluster12 is now down to 1 server
23:54 logmsgbot: tstarling synchronized php-1.5/db.php
23:52 Tim: install ganglia-metrics on db24
23:13 Tim: set up logrotate on locke, using the same script that we use for MW debug logs
22:56 domas: kickbanned security threat from #wikimedia-tech, was trying to install keylogger and steal our passwords
22:54 Tim: removed old squid log stream going to iris. Set up a log stream going from all the squids to locke.
22:37 Tim: depooled adler, it's down
22:36 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled adler'
21:01 mark: removed vlan 5 on csw5-pmtpa that was accidently created/left behind by Tim
20:47 domas: installed snaprotate on db26, enabled snapshots with 2x8h schedule, updated
Database snapshots
20:44 logmsgbot: midom synchronized php-1.5/includes/api/ApiQueryRevisions.php 'livemerging r48642'
20:43 logmsgbot: midom synchronized php-1.5/includes/filerepo/ArchivedFile.php 'livemerging r48644'
20:27 domas: someone who wrote ArchivedFile::load, needs some pain and torture applied (query doesn't use index... ;-)
20:25 logmsgbot: midom synchronized php-1.5/db.php 'reduced load for db26 from 200 to 100, as it has reduced amount of RAM and increased amount of other work'
20:09 domas: added db26 to s1 pool
20:08 logmsgbot: midom synchronized php-1.5/db.php
18:47 mark: JeLuF> |log \o/
18:46 domas: pooled db22 back in, as it caught up on replication!
18:46 mark: raised karma of /h/w/b/reset-mysql-slave
18:46 logmsgbot: midom synchronized php-1.5/db.php
18:45 domas: ran /h/w/b/reset-mysql/slave on db26, after copy from db22!
18:45 domas: wrote /h/w/b/reset-mysql-slave to reset mysql slaves!
18:32 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '15880 Pseudo-Namespace on Korean Wikipedia'
18:25 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '17699 Create Appendix namespace on Spanish Wiktionary'
18:19 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '17987 Set $wgBlockAllowsUTEdit = true for zh.wikipedia'
18:16 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '16446 tlwikibooks namespaces to be searched by default'
18:14 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '16446 Add Pagluluto: namespace to tlwikibooks'
18:12 domas: db26 bootstrap problem was inconsistent ibdata specification, probably first copy would've been enough :)
17:55 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '17818 Create "Wikijunior" namespace on Polish Wikibooks (fix)'
17:53 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '17818 Create "Wikijunior" namespace on Polish Wikibooks'
17:41 domas: I'm out of luck, copy from db22 to db26 has failed again
17:32 JeLuF: changed sync-common-file to automatically log messages to
Server admin log
17:27 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php "17739 nowiktionary logo"
17:18 JeLuF: changed slwikibooks, slwikisource logos to $stdlogo
17:17 domas: tried to poke Werdna about AbuseFilter regexp warnings, but he didn't listen (check remotelogtail @ db20 )
17:16 domas: restarted srv170 - was throwing occasional segfaults
17:15 domas: 'RESET SLAVE" on db22, to clean the relay log congestion
17:14 JeLuF: changed mgwiki logo to $stdlogo
17:13 domas: attempted to reboot adler using sysrq, failed at it. adler needs datacenter service
17:12 domas: did set up Jens access rights on wikitech, because nobody did that before
17:11 JeLuF: sahwiki and stqwiki logos changed to $stdlogo
17:11 domas: tested block compression on 'pagelinks' table on lomaria's innodb/1.0.3
17:11 domas: tested key packing on 'pagelinks' table on lomaria's innodb/1.0.3
17:10 domas: erased test mysql 5.1 builds on db26
17:10 domas: restarted data copy from db22 to db26
17:10 domas: forgot to stop mysqld on db26 :)
17:09 domas: started copying data from db22 to db26
17:09 domas: took out db22 for copy to db26
16:53 JeLuF: locked rn.wiktionary
12:48 domas: truncated log/remote on db20, had 6G of adler kernel noise, firewalled out adler syslog stream
06:10 Tim: started trackBlobs.php for all fedora clusters
03:20 Tim: rebuilt udplog for hardy and installed it on locke
02:55 Tim: installed ganglia and NRPE on locke
02:48 Tim: renamed db6 to locke in csw5 port list, this wiki, dsh node lists
01:28 Tim: reinstalled db6 as "locke", with Ubuntu 8.04, RAID 5
00:22 Tim: removed db8 from groupLoadsBySection, was causing it to be included in lag reporting
March 19
23:04 Tim: prepared DNS for a rename of db6 to "locke", for use as a squid log server, with a new IP address on the public subnet
22:47 river: Morebots 0 - 1 Moarbots
22:46 Andrew: Rmoved identica code from morebots for now, it's giving annoying error messages.
22:28 mark: Removed sq1 from upload squids node group
22:11 Tim: moving updateLinks.php job from zwinger to hume
21:58 Tim: restarted apache on srv62, srv82, srv151
21:53 Tim: hume is down, OOM, rebooting. Was down since about 18:00.
20:46 brion: poke poke
20:42 Rob: ran extensions/FlaggedRevs/archives/patch-fpc_level.sql on existing flaggedrevs dbs (did it about 15 minutes ago)
20:41 Rob: wikitech died, rebooted.
20:06 Rob: running updatelinks php script on flaggedrevs dbs per aaron
19:58 Rob: enabled pdf collection on meta
19:38 Rob: updated initialisesettings for 16342 Enable flood flag configuration on English Wikibooks
18:59 Rob: updated initialisesettings for bug 17986 Set $wgBlockAllowsUTEdit = true for zh.wikibooks
17:18 Rob: updated Initialisesettings for 14079 Configure groups in nowiki for access to Special:Unwatchedpages
17:12 Rob: added namespace aliases to zhwiki and ran dupe checking per 17885 Aliases of 'Wikipedia talk' namespace in Chinese Wikipedia
17:00 Rob: 16426 Enable subpages in template namespace on MediaWiki.org is done.
16:53 Rob: updated the logo in sul login for wikibooks
15:24 Rob: updated CommonSettings.php per bug 17453
15:16 Rob: ran namespacedupes script for zhwiki per bug 17701
15:12 Rob: added new namespaces to nowikisouce per bug 16232
15:04 Rob: added 6 new namespaces to zhwikisource per bug 15722
14:27 Rob: updated flaggedrevs onto iawiki per
14:11 Rob: ran flaggedrevs autopromote script on iswiktionary
14:08 Rob: updated for 16476 Enable FlaggedRevs Patrolling Configuration on is.wiktionary
14:07 Rob: updated for 16427 Set $wgRestrictDisplayTitle to False on Chinese Wikipedia
10:51 domas: disabled write barriers on lomaria's /a
08:55 river: undepooled adler, depooled db8
06:21 Andrew: synced r48573, bug in testing interface
02:53 Andrew: Synced r48564 to sites to allow more self-policing of filter performance, displaying run time in ms on the filter page itself.
02:44 Andrew: Updated AbuseFilter to r48564 on test to check filter profiling.
01:27 Andrew: GIving abusefilter-revert right to enwiki admins
00:37 Andrew: Synchronised AbuseFilter.parser.php for short-circuiting
00:22 Andrew: Updating AbuseFilter on test to r48553
March 18
20:44 brion: also req'ing doc updates for
Upload filesystem snapshots
20:43 brion: made a quick note about our
database snapshots
, needs more docs
20:21 Tim: deploying AjaxResponse.php r48531
20:20 brion: disabling image moving due to reports of breakage
20:18 brion: synced r48525 for temp xss fix in abusefilter ajax
19:45 Tim: re-enabled AbuseFilter with per-filter profiling
19:42 brion: disabling AbuseFilter on en.wikipedia.org; performance problems on save. Needs proper per-filter profiling for further investigation.
05:49 Andrew: Live-hacked in r48512 on AbuseFilter -- visual diffs on details page.
04:14 Andrew: Synced r48509 in AbuseFilter -- cross-filter diffing allowing leaking of hidden filters.
March 17
23:32 Andrew: Activated AbuseFilter on enwiki
23:18 Andrew: Scapping to update AbuseFilter
23:12 Andrew: Updating AbuseFilter to r48500 on testwiki.
22:48 Tim: NRPE installed on srv100
22:37 Tim: installed NRPE on db17, adler, thistle, lomaria, db30. Fixed NRPE on thistle.
22:29 Tim: deleted cluster13 and cluster14 backups on storage2
14:19 Rob: updated logo for pntwiki per bug 17960 Update Logo for pntwiki
03:27 Tim: deployed OggPlayer.js r48477
00:16 Andrew: Scapping to update AbuseFilter
March 16
23:30 Andrew: Conflict in AbuseFilter resolved. AbuseFilter on testwiki only updated to r48466. Will roll out to other wikis in an hour or so.
23:25 Andrew: Conflict in extensions/AbuseFilter/Views/AbuseFilterViewHistory.php -- DO NOT SCAP
23:09 brion: enabling $wgAllowImageMoving sitewide. Default group permissions allow image moving for sysops only, so should be safe-ish.
22:10 brion: setting up basic TIFF upload support on test & commons (
bugzilla:17714
) per req of image restoration folks. No thumbnailing yet.
20:36 brion: set up aboostani on SVN
18:43 Rob: updated akismet plugin on blog.wikimedia.org
17:22 Rob: and brion called me a weenie cuz I do not do enough SVN work.
17:21 Rob: updated en.planet with the new tech blog feed
15:48 Rob: forced ssl login and admin panel for techblog, rest moves back to standard http
15:45 Rob: setup https for techblog
08:07 Andrew: Bug 17998 Allow autoconfirmed users to see filters and logs on ruwiki
03:27 Andrew: Bug 17071 - Allow import rights to be added/removed by bureaucrats on mediawikiwiki
March 15
11:21 domas: lomaria runs dewiki on 5.1.33/innodb1.0.3
March 14
20:39 mark: db4 was being used for special page updates from hume and lagged, reduced its load from 150 to 50
13:10 mark: Reduced cache_mem on backend squid sq28 to see if memory pressure is causing some issues
07:03 Andrew: added /etc/init.d/morebots to wikitech, to auto-start morebots. also made it auto-restart on crash
March 13
23:11 Rob: migrated survey software from isidore to singer
22:21 Rob: restarted apache on singer after enabling all the mod rewrite stuffs
21:58 Rob: redeployed ./sync all for squid for whygive migration
21:58 Rob: disabled blog and whygive apache virtual hosts on isidore
21:58 Rob: migrated old whygive.wikimedia.org from isidore to singer.
21:21 Rob:
is online (although quite sad and empty)
21:14 Rob: setup tech blog on singer with database residing on db9
21:06 Rob: updated dns for new techblog (not yet live)
21:06 Rob: updated squids configuration for blog move
21:00 Rob: moved blog.wikimedia.org from old server isidore to new server singer
20:18 Tim: srv85 died, possible disk failure, no ssh or memcached, still has HTTP. Removed it from the memcached list, removed it from apache LVS.
20:15 Rob: updated spamfree plugin on blog.wikimedia.org
20:04 Rob: updated to newest version of wordpress on blog.wikimedia.org and whygive.wikimedia.org
17:24 Rob: updated flaggedrevs.php per bug 16365
17:23 Tim: removed password auth from nagios
15:28 river: ms4 disk c5t5d0 failed
March 12
18:08 Tim: restarted apache on srv75 and srv103
March 11
20:40 Rob: updated Initialisesettings for 17893 en.wiktionary bureaucrats can't add sysop, crat and bot flags (they also could not remove them until now)
00:37 Andrew: Live-hacking out r48087 (localisation changes for AbuseFilter with core dependencies) until a full code update.
00:13 Andrew: Activating AbuseFilter on arwiki
00:08 mark: Added httpdconf module to rsyncd on db20, and also restricted access to certain subnets
00:02 Andrew: scapping to update AbuseFilter
March 10
23:31 tomaszf: renabled central notice for all template generation
23:30 Andrew: on testwiki, that is
23:29 Andrew: updated AuseFilter to r48294 for debugging
22:58 Andrew: updated AbuseFilter to r48288 on test.
22:32 Tim: restarted memcached on srv63
22:24 brion: updating Collection extension to r48283; fixes
18:45 logmsgbot: hi identi.ca folks! domas and mark just cleared #profiling data!
17:27 brion: reenabling limited centralnotice update cronjob on hume, since the live notice ain't working
00:52 brion: domas makes it all better. yay domas!
00:50 brion: cluster21 master (srv161) in read-only; we're having write failure problems on multiple wikis
00:48 brion: putting ES back into service on enwiki; srv160/cluster20 master has been fixed. slaves still running it, but this is safe for read/write since reads will fall back to master
March 9
20:59 Rob: fixed bug 17893 en.wiktionary bureaucrats can't add sysop, crat and bot flags
20:15ish-20:30ish Brion:
some (unspecified) blob tables in ES broken due to MAX_ROWS=1m. writes broken on enwiki. domas is rebuilding tables
disabling wgDefaultExternalStore on enwiki temporarily as hack measure. last text_id was 274347580 before
now actually disabling read-only too
09:50 Andrew: Live-synced r48211 -- fixes bug 17877, which is a security issue because it allows accounts to be created which need a bureaucrat to block.
09:15 domas:
db24
..
March 8
13:40 domas: srv125 didn't come up after sysrq-b
March 7
02:32 brion: updated wikibugs to r48113, fixes issue with resolved bugs
02:16 brion: installing build-essential on mchenry so CPAN will work
02:11 brion: CPAN sux
02:04 brion: installing CPAN Email::MIME on
mchenry
for new wikibugs...
01:59 brion: updated
wikibugs
to r48110, which actually uses an email parser (omg)
March 6
18:26 brion: restarting dump threads on
srv31
, down since it was rebooted a few days ago
06:37 Andrew: Live-hacking out the "save and share" functionality of Collections (UI and processing). It uses Article::doEdit and does absolutely no checking of permissions or filtering.
02:25 brion: reverting live
wikibugs
to older version, we need some further tweaking on the mail decoding :D
02:21 brion: updated wikibugs to r48080, which should handle unicode subjects even when the body is plain ascii
02:07 brion: updated
wikibugs
to r48079
02:05 brion: installing MIME::QuotedPrint perl module on mchenry to educate
wikibugs
about mail subject encodings
01:25 brion: adding 'hideuser' right to oversight group, forgot that one in january
00:58 brion: synced SpecialDeletedContributions for r47930 fix
March 5
22:53 Andrew: Ran namespaceDupes.php --fix --wiki oswiki for bug 17776
22:43 domas: edited /etc/udev/rules.d/70-persistent-net.rules on db28 to move back eth4 to eth0 :)
21:43 mark: Added community to 1299 session to prepend outbound announcements to 2828 once
21:42 Rob: updated DNS for bug 16955
21:25 Rob: ran php namespaceDupes.php --fix and it didnt seem to bork crap. Also checked the script itself. Seems ok and I did not break the entire site, so yay?
19:47 Rob: db28 mainboard swapped, booting up
19:28 mark: Set up NIS on zwinger
19:01 Rob: shutdown db26 to check its memory (lag, log, blah)
18:43 brion: removing obsolete, unused 'developer' group (
bugzilla:12569
16:49 Rob: updated InitialiseSettings for bug 17701 Alias of 'Wikipedia' namespace in Chinese Wikipedia
16:39 Rob: updated InitialiseSettings for bug 17307 Remove {{DISPLAYTITLE:}} restrictions for wm2009wiki
15:07 domas: db21 needs 2.6.28, hit the 2.6.24 deadlock problem
March 4
21:41 Rob: final errors in OCS configuration fixed. ocs.wikimania2009.wikimedia.org is now working properly
20:36 Rob: added singer to nagios
20:22 Rob: setup proper email sending for ocs software install
14:54 Rob: pushed changes to InitialiseSettings for bug 13055 Make newpage patrolling available to autoconfirmed users on jawp
14:43 Rob: pushed changes to InitialiseSettings for bug 16289 patrol function assigned to all users, not autoconfirmed users, on nlwiki
14:16 mark: Brought knsq7 back up
07:38 domas:
srv217
was acting funny - lots of hanging processes which all decided to quit at strace. system locked up eventually, and hardware powercycle was done :)
07:37 domas:
db24
hanged again (triple checked, not when producing a snapshot), nothing in dmesg or SP log
March 3
18:26 brion: setting our nameservers for transferred wikpedia.org domain
17:09 domas: re-enabled db5 and db25, db25 is serving as snaprotate slave for s3.
06:15 river: depooled db5 instead to dump from, left adler out of rotation
06:13 river: adler mysqld crashed during dump, i/o error: sd 2:2:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
06:02 river: restarted replication on db30
05:06 river: repooled db11 since it's the master and depooled db1 instead
05:04 river: depooled db11 to dump s3
March 2
18:57 brion: updating Collection ext to r47946; includes coll-license_url message for per-wiki license URL override
18:57 brion: morebots is dead :(
16:15 Rob: Updated and synced InitialiseSettings for bug 15927 Enable Special:Nuke on pl.wikipedia
03:56 river: removed db30 from rotation to dump commons
March 1
17:30 river: repooled db7
16:29 river: unpooled db7 to dump externallinks
09:54 Tim: Mirror update done. Doing apt-get update/upgrade and reboot test on srv188.
07:45 Andrew: Deploying AbuseFilter to ruwiki per bug 17729
07:05 Tim: deleting files from khaldun:/srv/ubuntu/pool that are over 2 years old
06:52 Tim: khaldun out of disk space as expected
05:51 Tim: switched the ubuntu mirror to use the one at neu.edu, which is much closer (43ms RTT) and thus faster for small files than the osuosl.org one
05:46 Tim: updated ApiQueryUsers.php to r47865
05:06 Tim: disabled mirror cron job on khaldun temporarily, while the update running in my terminal completes
05:00 Tim: khaldun had not been updating its ubuntu repositories since roughly September last year, due to the lack of a ~/.gnupg/trustedkeys.gpg file, for gpgv. Also the option syntax for debmirror had changed, breaking b/c, and feisty was removed from OSL. Fixed everything, removed feisty.
04:26 Tim: testing ubuntu mirroring on khaldun
February 28
03:19 river: put db7 back in rotation
February 27
22:37 Rob: setup daily backups of web directories from singer to tridge
22:25 Rob: setup https on singer for ocs software
21:37 Rob: singer installed and up, running wikimania2009 OCS software only at present. Will migrate other services to it at a later date/time.
21:03 river: stopped replication on db7 to dump s1
19:24 Rob: installing and setting up new r300 singer for OCS install and eventual blog migration.
12:21 river: mounted /mnt/upload5 on srv78
05:55 Tim: thistle was toast due to ChangeTags screwing up the query plan for RecentChanges for unfiltered queries. It was doing a scan of the tag_summary table followed by a filesort of recentchanges. Needs FORCE INDEX(ts_rc_id). Disabled with live patch.
February 26
23:06 mark: Attempted Ubuntu install on ms2, but Ubuntu doesn't fully support systems with > 16 disks. Awaiting their fix.
23:06 mark: Downpreffed paths _2828_7473_ and _2828_4637_
22:37 brion: installed swap-watchdog on pdf1
22:25 brion: tom rebooting pdf1 by LOM
22:20 brion: swapping spike on pdf1
21:54 mark: Set up link aggregation over two GigE links on ms3
21:49 Rob: replaced the favicon file for wikibooks and synced out per bug 17049 New favicon for Wikibooks projects
21:36 Rob: setup futher log rotation on pdf1
21:31 brion: enabling Collection on sourceswiki ("old wikisource"), was forgotten the other day
21:15 Rob: setup log rotation of mw-serve.log on pdf1
21:02 mark: Upgraded iLOM/BIOS on ms2
20:32 Rob: restarted apache on srv193, srv197, srv201, srv217, srv223
20:28 Rob: restarted apache on srv62, srv78, srv100, srv133, srv170, srv173, srv175, srv190
20:18 Rob: restarted apache on srv135 and srv163 due to segfaults
20:10 brion: fixed up cache clear on pdf1. note we need to set up log rotation
19:44 Rob: nope, didnt break it, it worked. Updated InitialiseSettings per bug 17295 Please allow sidebar links to be localisable for Wikimania 2009wiki
19:44 Rob: pushing a change per bug 17295 that could very well crash wikimania2009 wiki, lets find out! =]
19:41 brion: merging r47836 to restrict 'pdf version' link in restricted mode
19:37 mark: Moved ms2 to internal vlan with new ips and dns entries
19:28 brion: testing rollout of PDF generation extension on en.wikipedia.org; with collection portal limited to logged-in users only
19:19 Rob: srv31 commented out of mediawiki-installation node group until its back online
19:16 Rob: updated mw-lib on pdf1, restarted pdf generation service
19:16 Rob: pdf generation offline for update
18:17 brion: srv31 appears to be a bit borked again. sigh.
18:05 Rob: Added wgRestrictDisplayTitle to InitialiseSettings to support bug 17307 Remove {{DISPLAYTITLE:}} restrictions for wm2009wiki. Also removed the wgAllowDisplayTitle as it was an old testwiki flag that is no longer used.
17:49 Rob: Updated InitialiseSettings to support bug 17361 Enable DyanmicPageList for Wikimania2009wiki
16:44 mark: Brought BGP session to 2828 back up
14:59 Rob: updated InitialiseSettings for 17201 Enable the ability for crats on the Arbcom private wiki (English Wikipedia) to also remove crat and sysop
14:47 Rob: fixed change in InitialiseSettings for bug 17388
14:42 Rob: rolling back change because I changed the wrong entry.
14:29 Rob: Implemented changes to Initilizesettings for bug 17388 Allow bureaucrats to revoke sysop status on UK chapter website
11:13 mark: Shut down BGP session to 2828, many reports of partial reachability, seemingly due to a broken link in a multipath
06:12 Tim: running gearmanRefreshLinks.php on hume for all wikis with 20 worker threads
05:24 Tim: on hume: deleted the unnecessarily aggressive default configuration from /etc/ufw to allow ufw to be enabled without immediately taking the server off the net or rate limiting all incoming connections to 3 per minute. Drop incoming connections to gearman.
February 25
22:21 Andrew: Deploying AbuseFilter to nowiki.
18:58 mark: Installed ms3 on 2 of the 4 bootable disk drives (sda and sdi), on a 80 GB RAID-1 root partition, leaving the rest of those 250 GB drives free, as well as the other 46 drives
18:24 Rob: srv215 deployed into service
18:16 Rob: installed and setting up srv215
18:08 brion: merging r47809 live -- XSS fix for Collection extension
18:01 brion: temporarily halting apache on srv78, which seems to have some borkage with nfs mounts and nis
17:55 brion: srv78: ls: cannot access /mnt/upload5/wikipedia: No such file or directory
17:53 Rob: rename auth2 to sockpuppet in dns, pushed changes
16:55 Rob: fixed IP allocation for ms4 and ms3
16:55 Rob: reset ms4 by accident due to IP misassignment on service processors for ms4 and ms3
05:29 Tim: updated ApiQueryLogEvents.php to r47781
04:21 Tim: killed long-running ApiQueryLogEvents queries on db4 (12000 seconds)
01:11 Andrew: Deployed AbuseFilter on dewiki per request of DaB
01:08 Andrew: adding AbuseFilter tables for dewiki
00:03 brion: running mw-serve cache cleaning on pdf1; cron job was borked by missing log file
February 24
22:46 Andrew: Deployed AbuseFilter on metawiki.
22:43 Andrew: Adding abuse filter tables to metawiki.
22:30 brion: syncing merge of r47769 to fix regression in page history
22:22 brion: fix to disable slow tag search has had a side effect breaking page history. Should be tidied up in a few minutes.
22:15 brion: merging r47767 live to disable $wgUseTagFilter
20:37 Rob: updating dns for wikimedia.cz
20:32 brion: turning Collection/PDF on for all wikisources
19:47 mark: Shutting down Apache and unmounting /home on srv32, for puppet testing
19:03 Rob: updated wikibooks logo stuff per bug # 17034
17:13 brion: just noting that TeliaSonera has a planned maintenance which may affect KNAMS connectivity for a few minutes around 2009-02-25 05:00 UTC.
16:51 brion: restarting some dump threads on srv31
01:55 river: ms1 is now replicating to ms4 instead of ms2
00:06 Rob: reverted cs.wikimedia.org and cz.wikimedia.org redirects for danny
February 23
22:56 RobH_: srv136 reinstalled and redeployed as apache
22:32 mark: Installed puppet (test install) and thereby automatically gmond as aggregator on srv33
22:27 mark: Installed puppet (test install) and thereby gmond on srv32
21:16 domas: adler has disk with media errors (ID:5, 6th disk in array):
- needs cannibalized samuel, disk replacement, and ubuntu install on raid10
19:04 Rob: srv136 back from repairs, reinstalling as apache server
18:44 Rob: srv217 not running apache, synced and restarted
18:29 Rob: srv33 reinstalled to ubuntu and deployed as apache server
18:24 Rob: srv32 reinstalled to ubuntu and deployed as apache server
17:55 Rob: reinstalling srv32 to ubuntu
17:38 Rob: resynced and restarted apache on srv32, srv33, srv34
17:32 Rob: srv31 powered back up
17:25 Rob: found a breaker flip in the DC, affects srv31-srv34
13:40 domas: oh, btw folks, kudos on perfect web2.0 engineering, now morebots complains when message is longer than 140 bytes, and we end up without our microblogging syndication
13:39 domas: added "su -m 'www-data' -c 'find /opt/mwlib/var/cache/ -mindepth 3 -mtime +1 -delete'" to pdf1 crontab, does anyone actually look after this service?
12:57 Tim: deployed r47704, now command line scripts don't access /home anymore
11:37 Tim: switched archive directory over to /mnt/upload5, starting another rsync. Some files will be missing until the rsync is done
10:07 Tim: moved all job runners from the previous ad hoc script to the new wikimedia-job-runner package
06:25 Tim: moved the nagios plugins for fedora from /home/nagios to /h/w/common/nagios-fedora-plugins
05:21 Tim: started udp2log on db20, MW UDP logs were dead
05:19 Tim: killed errant jobs loop scripts still running on fedora servers
04:36 Tim: fixed the log directory for /etc/cron.d/mw-central-notice, killed the process that was in a tight loop trying to write to a stale NFS file handle
04:28 Tim: finished moving ExtensionDistributor working copy
04:14 Tim: moving ExtensionDistributor working directory from /home to /mnt/upload5
04:00 Tim: private/archive/wikipedia was in fact not migrated, but an initial rsync was done. I will do a second rsync now.
03:42 Tim: rsync done, uploads re-enabled, b/c symlinks set up
03:37 Tim: doing rsync
03:31 Tim: temporarily disabled file uploads on all private wikis, for migration to ms1
03:10 Tim: confirmed with maintenance/getRealUploadDir.php that all wikis except the private wikis have an upload directory which symlinks to /mnt/upload5. Changed $wgUploadDirectory in InitialiseSettings.php accordingly. Deleted some ancient commented-out code from CommonSettings.php.
02:50 Tim: same for commons ForeignDBViaLBRepo directory, ScanSet directory, CentralNotice directory,
02:44 Tim: fixed CommonSettings.php location of deleted images, upload3 -> upload5, appears to have been moved already
February 21
19:49 mark: Installed gmond on eiximenis
19:02 domas: db26 lacks 8g of ram :)
19:00 mark: Restarted stuck apache on srv217
17:26 mark: Started apache on srv218-221
17:24 mark: Restarted stuck apache on srv217
17:07 mark: Squid/kernel upgrade complete
16:46 mark: Increased max-connections per upload squid to ms1 to 100
15:58 mark: Running automated upgrade/reboot of squid and kernel on sq43-47
15:58 mark: Upgraded squid and kernel on sq41-42, sq48-50, and rebooted
15:44 mark: Upgraded squid and kernel on sq36-40, and rebooted
12:55 river: fixed reverse dns entries for ms3/ms4, which had got swapped somehow
11:55 Tim: re-enabled ExtensionDistributor
11:16 Tim: removed syslog.0 and messages.0 on srv170 and srv176, they had critical disk free on /
03:25 Tim: started apache on the image scaling servers
02:51 brion: ran sync-common on srv199 while i'm at it
02:48 brion: zeroing out stupid giant syslog files on srv199
02:46 brion: srv199 is out of disk space
02:46 brion: copying hacked-up copies of InitialiseSettings/CommonSettings back to /home so the changes aren't lost this time
02:22 mark: db20 back up, for reals
02:19 mark: Rebooting db20 with upgraded RAID controller firmware
02:13 domas: flashing BIOS helped
02:13 mark: db20 up!
02:03 brion: services on bart (secure, planet) are temporarily offline while server is poked at
01:50 brion: seeing pages, yay
01:49 brion: running apache2ctl start or apachectl start for various apaches
01:47 domas: I FOUND HOW TO REVIVE APACHES
01:46 brion: think i killed em, now trying to restart apache procs
01:43 brion: poking to see if we can restart apaches...
01:42 brion: syncing fixed InitialiseSettings/COmmonSettings to apaches
01:14 brion: and flyingparchment
01:14 brion: domas and mark are attempting to restart the NFS server, but aren't mentioning any details in the public channel or log
00:52 domas:
00:52 mark: db20 in trouble
00:39 mark: @brion you don't need to wake up
00:36 domas: disabled 2006 fundraising cronjob on amane :-)
February 20
23:31 Rob: upgraded squid and kernel on sq34-sq36
23:12 Rob: upgraded kernel and squid on sq31-sq33, redeployed and online
23:08 brion: updating CentralNotice for improved test script (plus i8n update)
22:54 Rob: upgraded kernel + squid on sq28-sq30
22:29 Rob: completed upgrades to sq25-sq27
22:12 Rob: upgrading kernel and squid versions on sq25-sq27 (if i crash the site, i apologize in advance)
22:08 Rob: upgraded kernel and squid on sq24
21:59 river: added current patches to ms4, set zil_disable=1 and rebooted
21:30 brion: srv31 seems to be down, so no dump activity
21:08 brion: scapping to update FlaggedRevs to r47588 (fixing fatal err)
21:01 Rob: updated kernel and squid on sq23
20:58 Rob: updated kernel and squid on sq22
20:36 Rob: updated kernel and squid on sq20 and sq21
20:25 domas: some apaches in crashloop like this:
20:09 Rob: restarted apache on srv74
20:03 Rob: upgraded kernel and squid on sq19
19:50 Rob: upgraded kernel + squid on sq18
19:34 Rob: upgraded kernel + squid on sq17
19:19 brion: updating FlaggedRevs to r47574
18:16 river: set zil_disable on ms1 to improve nfs write performance
18:15 mark: Raised max-conns to 50
18:03 mark: Cut down max conns even more (25) for pmtpa upload backend squids
17:40 mark: Limited maximum connections to backend (ms1) to 50 per squid on upload squids, 1000 per squid on text
16:17 domas: plenty of fedoras had futex deadlocks
16:16 Rob: upgraded kernel and squid on sq14 and sq15
15:49 Rob: updated squid and kernel on sq13, rebooted, back online
15:26 Rob: upgraded squid and kernel on sq9-sq12 (not all at the same time)
14:59 Rob: upgraded squid and kernel on sq5, sq6, sq7, sq8
14:51 Rob: upgraded squid and kernel on sq2-sq4
14:50 Tim: updated ContactPage extension, will deploy it on nlwiki shortly
10:52 mark: Reduced
cache_mem
from 3000 to 2500 for pmtpa upload backend squids - no restart, will take effect with the 2.7 upgrade later today
10:00 mark: Started backend squid on sq26, it was gone
February 19
23:54 brion: updating AbuseFilter to r47523 :P
23:51 brion: updating AbuseFilter to r47522
23:40 brion: updating FlaggedRevs to r47522
23:39 Andrew: Enabled Abuse Filter on MediaWiki.org
23:17 mark: Stopped experimental varnish on sq1, please keep Squid off as well
22:52 Andrew: Allowed bureaucrats to remove 'sysop' right on testwiki.
22:42 brion: updating includes/api to r47522 to fix a couple regressions
22:15 mark: Started an experimental varnish instance on sq1 port 80
21:22 mark: Stopped Squids on sq1
14:23 Tim: removing memcached from srv154,srv155,srv157,srv158,srv169,srv170
14:18 Tim: started memcached on srv190-199
14:06 mark: Added "vport=80" to the http_host directive on all backend squids, to force Squid to use the default HTTP port, 80
10:53 domas: livemerged r47483 (backlinks cache read explicit order, :( )
07:56 Tim: restarted job runners with 4 processes per server instead of 1. Db2 is now heavily loaded, apparently due to the SELECT queries involved in the large numbers of unnecessary refreshLinks2 jobs that were queued before r47478 went live. But they should be done in a few hours at this rate.
05:00 Brion: enabling Collection on fr, pl, nl, pt, es, simple Wikipedias
02:12 Tim: deploying r47478
February 18
22:41 Andrew: morebots back up, now logs to identi.ca with the name wikimediatech
22:38 tomaszf: installed srv208 with Ubuntu 8.10.1 and installed app sever software.
22:12 domas: Andrew killed morebots. let's see how he fixes it... :)
21:59 Rob: PDF creation moved to pdf1
21:58 Rob: changed pdf generation from eruzumi to pdf1, testing.
19:21 Rob: srv255 changed to pdf1 and moved, drac setup along with dns resolution
19:19 brion: scapping
19:18 brion: svn up'ing test to r47457
18:37 Rob: reinstalling srv209 due to dhcp misconfiguration making it think it was srv208
15:13 mark: Restarted all upload frontend squids to get rid of the memleaking
14:20 mark: Blocked all non-GET/HEAD HTTP methods in requests to upload frontend squids
12:46 Tim: put r47447 live for temporary proposed fix of bug 17552
08:38 Tim: svn up r47434 to fix Special:BrokenRedirects
08:04 Tim: cleaned up binlogs on db2
06:33 brion: note there's a live hack in api categorymembers query which may be breaking lookups
05:54 Tim: set up bugzilla attachment_base, pointing to the new domain
, and set allow_attachment_display=on
05:51 brion: disabling $wgTorTagChanges in CommonSettings after the ext gets loaded (needs fix for testwiki)
05:46 brion: syncing reverted expr.php w/o bc stuff
05:25 brion: syncing extensions/FlaggedRevs/specialpages/OldReviewedPages_body.php fix
05:24 brion: syncing fix to Expr.php for bcpow() error
05:16 brion: syncing fix to extensions/ParserFunctions/Expr.php
04:59 brion: starting scap process...
04:52 brion: svn up'ing test to r47418
04:45 brion: svn up'd test to 47417
04:30 brion: removing editor, reviewer from add/remove for all users in test. that ws an old test not needed anymore :D
03:42 brion: rc tags tables created sitewide; should be safe to scap and check for final problems if we're brave
03:35 brion: applying patch-change-tags to all wikis
02:57 brion: ran patch-change_tag.sql on testwiki
02:52 brion: full svn up'ing for test wiki
02:06 brion: worked around breakage with pager base class incompat with latest codereview :P
01:52 brion: svn up'ing CodeReview to aid in completing code review ;)
February 17
23:58 Rob: srv217-srv223 installed and online as apache servers. Updated dsh groups and nagios, as well as pybal
23:24 Rob: installed OS on srv217-srv223, moving on to package installation.
21:12 Rob: reinstalling srv209, which thought it was srv208. silly server. srv208 has not been installed, gave to tomasz to check against setup checklist.
21:05 Rob: actually, srv209 installed as 208, bad dhcp entry. Fixing
21:04 Rob: pulling srv208 and srv209 for quick reboots, their drac ips are wrong.
21:04 Rob: racked srv217-223 (also racked srv224/225 but no power yet)
18:30 brion: starting a batch run of update-special-pages-small just to ensure it actually works
18:25 brion: fixed hardcoded /usr/local path for PHP and use of obsolete /etc/cluster in update-special-pages and update-special-pages-small; removing misleading log files (
bugzilla:17534
03:19 Tim: removed live hack updating MW_DIFF_VERSION, changed on December 30 and the cache expiry is a week. Should not cause a significant amount of load.
03:01 Tim: removed live hacks from extension/Cite, updated to r47350.
01:49 Tim: deleting all enotif jobs from the job queue, there is still a huge backlog
February 16
16:46 mark: Did emergency rollback of squid 2.7.6 to squid 2.6.21 because of incompatible HTTP Host: header
16:21 Rob: stopped upgrades, sq36 completed before stop
16:17 Rob: performing upgrades to sq35-sq38 (not depooling in pybal, letting pybal handle that automatically)
16:16 Rob: performed dist-upgrade on sq31-34
15:35 Rob: depooled sq31-sq34 for upgrade
08:12 Tim: patched in r47309, Article.php tweak
05:00 Tim: made runJobs.php log to UDP instead of via stdout and NFS
04:53 Tim: fixed incorrect host keys in /etc/ssh/ssh_known_hosts for srv38, srv39 and srv77
04:13 Tim: removing all refreshLinks2 jobs from the job queue, duplicate removal is broken so to clear the backlog it's better to just run maintenance/refreshLinks.php
February 15
21:59 mark: Experimentally blocked non GET/HEAD HTTP methods on sq3 frontend squid
16:15 mark: Upgraded PyBal on lvs2 - others will follow
13:11 domas: db23 has multiple MCEs for same dimm logged:
12:38 domas: in wikistats, placed older than 10 days files into ./archive/yyyy/mm/ - maybe will make flack crash less :))
11:56 mark: Doing Squid memleak searching on sq1 with valgrind, pooled with weight 1 in LVS
03:09 Andrew: CentralNotice still not working properly, and when we tried to set it to testwiki-only, it never came up. Left it on testwiki only for the time being, until somebody who knows CentralNotice can take a look at it.
02:21 Tim: fixed permissions on the rest of the logs in /home/wikipedia/logs/norotate (fixes centralnotice)
February 14
19:19 Az1568_: re-enabled CentralNotice on testwiki to try and find the problem (we've had this before, but fixed it somehow...possibly with a regen? See November 16th log.)
18:34 domas: filed a bug at
- could use some Canonical escalation too
18:26 domas: same affected srv47 - this is related to switching locking to fcntl() - this drives apparmor crazy
17:47 domas: srv178 kernel memleaked few gigs. blame: apparmor
14:34 domas: srv215 very much dead, doesn't show vitality signs even after serveractionhardreset
14:28 domas: correction, srv208.mgmt is pointing to uninstalled box
14:27 domas: DRAC serial on all new boxes is ttyS1 which is not in securetty
14:24 domas: srv209.mgmt is actually srv208's SP, and srv208.mgmt is pointing to dead box
14:15 domas: srv209,215 down?
13:43 domas: installing php5-apc-3.0.19-1wm2 (no more futexes) on all ubuntu appservers.
10:02 Andrew: Reports that CentralNotice broke on all wikis, displaying just the message name in angle brackets, even though the message existed on meta. I have no idea what caused it and I couldn't find anybody who knows anything about it, so I disabled the notice itself on Special:CentralNotice on meta. Somebody who knows what they're doing should probably look into it later.
February 13
22:10 mark: esams squid upgrade complete
21:05 RobH: deployed srv207-srv216 in apaches cluster
20:34 RobH: added new servers to nagois and restarted it
20:15 RobH: setup all node groups, ganglia, apache, so on for srv199-srv206 and added into rotation
19:38 mark: Upgrading esams squids to 2.7.6
18:36 mark: Upgraded squid on sq1 to 2.7.6 and rebooted the box
18:03 mark: Memory leak issues on the upload frontend squids, which started in November
18:01 RobH: sq13 back online, seems there is a memory leak, go mark for finding =]
17:54 RobH: lomaria install done for domas
17:49 RobH: rebooting sq13 due to it failing out in ganglia, OOM error evident.
17:48 RobH: reinstalling lomaria per domas request
17:37 RobH: sq8 was unresponsive to console, locked up, rebooted, cleaned cache, and bringing back online
17:34 RobH: srv38 and srv39 back in rotation
17:23 RobH: srv38 and srv39 reinstalled, installing packages now
16:57 RobH: reinstalling srv38/srv39
16:57 RobH: srv80 reinstalled as ubuntu apache and back in rotation
16:31 RobH: srv79 back in rotation
16:21 RobH: srv79 reinstalled, installing packages and ganglia
16:12 RobH: reinstalling srv79
16:00 RobH: ganglia installed on srv77, back in rotation
15:55 RobH: srv77 redeployed as ubuntu apache server
15:48 RobH: reinstalling srv77 to ubuntu
February 12
23:59 brion: adding 'helppage' to ui-content messages on commons per
bugzilla:5925
23:01 RobH: racked and setup drac for srv298-srv216
21:20 mark: Killed blocked apache processes on srv180, and restarted apache
21:19 mark: Killed blocked apache processes on srv172, and restarted apache
21:07 brion: fixed ownership on log files for updateSpecialPages cronjob, which likely is what broke it
20:28 mark: Upgraded experimental squid 2.7.5 on knsq1 to squid 2.7.6
20:00 brion: fixed typo which broke access to revision deletion log for oversighters. tx to aaron for the spot :D
19:45 mark: Replaced "2 cpu apaches" group aggregator srv32 by srv35
18:55 RobH: racked, wired, and remote management setup for srv199-srv207
09:51 domas: added srv190-srv198 to apaches dsh group, as they seem to be alive and kicking
09:48 domas: changed weights for srv190-srv198 80->100 (to account for 1.85->2.5 ghz cpu step )
00:29 brion: running updateRestrictions on wikis to clean up remaining funky restrictions entries per
bugzilla:16846
00:22 Tim: restarted apache on srv172
February 11
23:23 mark: Pooled srv190-198
23:23 Tim: re-enabling search suggestions
23:19 mark: Installed Ganglia on srv190-198
23:17 mark: Installed MediaWiki application server packages on srv190-198
23:02 mark: Added srv190-198 to mediawiki_installation node_group (not any others)
22:55 mark: Ran dist-upgrade && reboot on srv190-198
22:46 mark: OS installed on srv190-198
22:19 RobH: racked and setup drac on srv195-srv198
22:11 RobH: racked and setup drac on srv192, srv193, srv194
22:00 RobH: racked and setup drac on srv190, srv191
21:24 brion: putting ixia back in rotation, it's caught up
20:05 brion: depooling ixia while it catches up
20:05 brion: ixia lagged 8810 secs
20:00 brion: ixia replication is broken -- causing contribs lag on itwiki
19:19 RobH: setup msw-a5-sdtpa like 30 minutes ago, opps ;]
19:00 mark: Added srv190-225 to DNS & DHCP
18:55 mark: set up RANCID for asw-a4-sdtpa and asw-a5-sdtpa
18:54 brion: disabled srv38,39,77,79,80 in lvs3 pybal config to ensure they don't go back into service accidentally until fixed up
18:37 brion: stopping apache on those bad machines for the moment
18:35 brion: srv38, 39, 77, 79, and 80 appear to have been prematurely put into apaches pool, running old version of PHP. need to be halted and upgraded
17:26 domas: restarted apache on srv154 after teh deadlock in apc
16:04 Tim: disabled checkers.php hack, using mwsuggest.js hack instead
15:52 Tim: emergency optimisation: disabled search suggest via checkers.php
15:41 domas: srv159 restarted as proper apache, not -DSCALER
09:02 domas: moved morebots to ~morebots@wikitech.wikimedia, startup line in rc.local :)
07:05 Tim: running maintenance/fixBug17442.php
06:56 Tim: restarted job runners
04:31 Tim: upgraded bugzilla to 3.0.8 with cvs up, and copied in the docs directory from the 3.0.8 tarball
03:31 Tim: gave myself an account on isidore, cleaned up some crap in /srv/org/wikimedia to /srv/org/wikimedia/backup
02:58 Tim: apt-get upgrade on isidore
February 10
23:47 mark: Moved upload esams LVS from mint to hawthorn
23:41 mark: Installed a specially compiled LVS Feisty kernel on hawthorn (running Hardy) & rebooted
22:33 RobH: updated mwlib on erzurumi per brion
22:25 RobH: some resets and such on searchidx1 to get ssh working. system is very sluggish.
19:28 brion: wikitech server crashed; CPU pegged and OOM. rob rebooted it, yay
02:46 Tim: running maintenance/fixBug17300.php to create missing redirect table entries
01:18 Tim: reverted PP caching patch
01:14 Tim: re-enabled search suggestions
February 9
23:13 domas: grunt session finished
23:10 domas: brought up srv80 from hibernation and made it work.
22:53 domas: added srv61 too
22:23 domas: added srv144 and srv147 to duty, added ganglia stuff too
22:01 domas: started appserver work on srv77,srv79
21:54 domas: started srv35,38,49 as appservers, restarted deadlocked srv49 processes
16:14 mark: Moved upload LVS back from hawthorn to mint - even a optimized 2.6.24 kernel is not fast enough to serve upload LVS
16:03 Tim: disabled search suggest as an emergency optimsation measure
16:02 mark: Rebooted hawthorn with an LVS optimized kernel, moved upload LVS back to it
15:53 mark: Moved upload esams LVS back to mint
15:37 mark: Moved upload.esams LVS from mint to hawthorn
15:28 mark: Reinstalled server hawthorn with Hardy 8.04
13:55 domas: fixed ganglia group for srv159 (it is scaler, not appserv)
13:51 domas: brought srv182 up
13:32 domas: repooled srv104 and srv105, after few months of vacation
13:20 domas: killed few orphaned tidy processes that were very very busy since Feb1
13:13 domas: heeheee, extorted this: [15:11] so, srv77,79,80, rose, coronelli and maurus could be converted to apaches
12:36 Tim: trying apc.localcache=1 on srv176
04:27 Tim: patching in r46936
03:48 Tim: attempting to reproduce APC lock contention on srv188
February 8
22:43 brion: may or may not have fixed that -- log file was unwritable. hard to test the command since 'su' bitches about apache not being loginabble on hume :P
22:39 brion: investigating why centralnotice update is still broken. getting fatal php errors wtf?
20:17 domas: we were hitting APC lock contention after some CPU peak. Dear Ops Team, please upgrade to APC with localcache support. :)))))
February 7
22:49 domas: db17 came up, but it crashed with different symptoms than other boxes, and it was running 2.6.28.1 kernel. might be previous hardware problems resurfacing
22:47 brion: chmod'ing centralnotice JS output on ms1 so batch processes running as 'apache' user can actually update them. hadn't been getting updated since february 5, leading to complaints when the swedes updated a translation on the steward banner
21:23 domas: db17 down
February 6
12:33 brion: stopped that process since it was taking a while and just saved it as an hourly cronjob. :) log to /opt/mwlib/var/log/cache-cleaning
12:28 brion: running mw-serve cache cleanup for files older than 24h
February 5
18:19 brion: put ulimit back with -v 1024000 that's better :D
18:18 brion: removed the ulimit; was unable to reach server with it in place
18:15 brion: hacked mw-serve to ulimit -v 102400 on erzurumi, see if this helps with the leaks for now
16:56 domas: rebooted erzuruzumi, placed swap-watchdog (
) into rc.local
16:03 mark: Added Qatar (634) to the list of esams countries
01:27 Tim: migrated arzwiki upload directory from amane to ms1
01:00 Tim: fixed arzwiki upload directory permissions
00:56 Tim: moved most cron jobs from admin user cron tabs to /etc/cron.d on hume
February 4
22:33 tomaszf: Adding cron for torblock under tfinc@hume
22:20 tomaszf: ran loadExitNodes() to update tor block list
18:36 brion: running TorBlock/loadExitNodes.php
17:25 brion: stripped BOM from en.planet config.ini; re-running.
17:24 brion_: attempting to run planet update for en.planet manually..... there's a config error
16:30 domas: stealing db27 for moar tests
February 3
13:05 mark: Remote-hands replaced some cables, fuchsia is back up but idling
06:57 Tim: doing some schema changes on the otrs database. Some fields should be blobs and are text instead, perhaps due to a previous 4.0 -> 5.0 MySQL upgrade
01:48 Tim: added blob_tracking table to ukwikimedia
01:42 Tim: repooled db3 and db4
00:34 mark: Moved traffic back
00:28 mark: Shutdown switchport of fuchsia in order to prevent it from interfering with mint (which took up text LVS as well as upload)
00:20 mark: Moved European traffic to pmtpa - text LVS unreachable
February 2
23:54 domas: took out db29 for some testing
22:07 mark: Modified Exim configuration on williams to not discard but delivered spam-recognized messages to
OTRS
with an X-OTRS-Queue: Junk header, as well as SpamAssassin headers
21:35 brion: reverting change to Cite_body.php
21:28 brion: caching for cite refs is known to cause problems with links randomly replacing with other links; likely strip marker problem. andrew is investigating
19:31 domas: merged in Andrew's Cite cache to live site
16:47 brion-sick: syncing update to Collection to do more efficient sidebar lookups
16:18 brion-sick: large spike in text backend service times
16:15 brion-sick: secure.wikimedia.org is returning 503 Service Temporarily Unavailable
08:11 Tim: removing ancient static HTML dump from srv31
08:05 Tim: removed cluster13 and cluster14 from db.php, will watch exception.log for attempted connections
08:02 Tim: removed srv130 from LVS and the apaches node group, not accessible by ssh but still serving pages
07:56 Tim: find /home/wikipedia/logs -size 0 -delete
07:43 Tim: re-added db22 to s1 rotation, no explanation for its removal in server admin log
06:39 Tim: dropped the otrs_test database
06:38 Tim: moved the OTRS database from otrs_real back to otrs. Updated exim4 config on mchenry
04:23 Tim: db10's relay log was corrupted, did a flush slave/change master
01:10 Tim: started mysqld on db23, doing recovery
00:59 Tim: rebooted db23
00:56 Tim: db23 down, depooled
00:05 Tim: adjusted innodb configuration on db10, restarted, starting replication
February 1
23:40 Tim: OTRS recovery script done
22:13 brion: updating rowikibooks logo
bugzilla:17273
(note the log bot is down again)
21:25 Tim: running script to copy deleted OTRS data from db10
20:40 mark: Lily was overloaded due to the long downtime of mchenry, stalling all mailing lists deliveries
20:39 mark: Granted SELECT access to mchenry and williams for database otrs_real - they've been giving temp rejects for hours
11:24 Tim: mysqld on db10 crashes when it tries to run the current replicated query. Probably needs a resync. Set --skip-slave-start
10:05 Tim: updated OTRS DB name on mchenry
09:53 Tim: reading in SQL backup
09:33 Tim: moving the otrs database to otrs_real to allow easier binlog import
03:52 Tim: done 1 and 2
03:10 Tim: recovery plan is as follows: 1. re-enable r/w web access, 2. compile a list of deleted IDs from the binlogs (confirmed that this is possible), 3. read in the pre-upgrade backup to a separate DB and execute binlogs to the appropriate point, 4. copy affected IDs from the backup to the live DB
02:52 Tim: patched GenericAgent.pm to prevent ticket deletion
02:27 Tim: it seems some admin inserted a GenericAgent job called "temp1" at 09:46 with the effect of deleting all tickets older than 30 days. The binlogs show a duplicate "Valid" key, with one row setting it to 0 and the next setting it to 1, so it's possible the user set valid=0 in the UI but due to a bug in OTRS, the job was considered valid. The job appears to have been run first at 09:46, probably from the web, then regularly at 10 minute intervals, most likely due to the cron job on bart which was not deactivated. I've now removed the relevant crontab and revoked bart's OTRS permissions.
01:11 Tim: put an explanatory note on the OTRS login screen and deleted all sessions to send users there
00:38 Tim: revoked write access from the otrs mysql user, to prevent any further damage. Making a copy of the binlogs. The plan is to do forensics first and then recovery second.
January 31
18:17 mark: Following reports of OTRS rapidly deleting old tickets/emails every ~ 10 minutes, I disabled (set to invalid) all GenericAgent jobs pending investigation
15:43 mark: Set
local_from_check = false
in exim.conf on williams, to prevent Sender headers from being added (annoying for Outlook users)
07:11 Tim: converting OTRS database to proper UTF-8 (instead of UTF-8 in latin1 fields) using ~/fix-schema.php
01:30 brion: updating eswikibooks logo
bugzilla:17078
00:55 brion: setting mswikibooks logo
bugzilla:17263
00:53 brion: copied wikimedia favicon to blog.wikimedia.org
bugzilla:17171
00:51 domas: lomaria needs reinstall, db24 and db30 are live in s2 duty
January 30
17:54 domas: *giggle*, booted up lomaria with SMP kernel
17:43 domas: lomaria kernel detects just one CPU (out of four)
17:26 domas: converted lomaria into dewiki-only server
14:20 Tim: Done with OTRS for now. Some bugs remain, particularly the missing ticket list in AgentTicketCustomer. I'll probably have to downgrade to 2.3.x tomorrow.
12:51 mark: Installed ganglia on williams
11:50 mark: Letting OTRS mail through to williams on mchenry
10:50 Tim: running upgrade of OTRS DB
10:44 mark: Removed all OTRS test copies in the queue of williams
10:42 mark: Deferring all OTRS mail on the queue of mchenry
10:30 mark: Put in a quick hack to forward misrouted OTRS mails from williams to bart
08:52 Tim: sent upgrade warning email to all OTRS agents
06:56 Tim: RCT should be finished now, no more connections are expected on cluster13 or 14. Current connection counts: 123943575, 295618929.
02:36 Tim: set up SSL on williams and switched ticket.wikimedia.org DNS to point to there
02:21 brion: set up new SSL cert for ticket.wikimedia.org; tim's poking at installing it
02:19 brion: updated password on tridge *cough*
01:43 brion: syncing update to Drafts with IE 7 fix (r46571 and style ver update)
00:16 brion: live-merging r46570 -- fixes to DB access in revisiondelete
January 29
22:55 mark: Did s/knams/esams/ on the selective AAAA answer config of ns0/ns1/ns2.wikimedia.org
22:47 mark: While messages are held in the queue on williams, use "mailq" to view the queue, and "exim -M " to let an individual message through for testing
22:44 mark: SpamAssassin training from the OTRS Junk queue not yet setup
22:43 mark:
Note:
Exim on williams queries for mail addresses from the
live
OTRS database, not the test database
22:42 mark: Completed OTRS mail setup on williams. wikitech documentation updated in
OTRS
and
Mail
. OTRS mail is still copied to williams, and then held on the queue.
22:00 mark: Added db10 as secondary DB to query for Exim on mchenry
21:59 mark: Granted SELECT privileges on otrs.system_address to exim@williams on db9/db10
21:58 brion: enabling revision & log suppression for oversighters
21:12 brion: live-merging r46429 change to Special:Contributions -- stub marking fix
21:01 mark: Copying OTRS mail to williams, where it's automatically held in the queue without extra processing; useful for testing
21:00 mark: Installed SpamAssassin on williams for OTRS, copied training data from bart
20:14 recompressTracked.php finished
19:18 brion: aborted old enwiki dump so a fresh one can start, since that old history will never finish on the old system
19:17 brion: updated data dump scripts
17:57 brion: disabled 'mark patrolled' link for views without specific rcid param; but now it's back when we actually ask for it so actual rc/new pages patrol works again
17:54 brion: poking at patrol link live hack
17:40 brion: erzurumi is rebooted and serving out PDFs again. need to implement some resource limits...
17:35 brion: rebooting erzurumi via drac
17:32 brion: i hate the drac shell
17:24 brion: erzurumi appears to have been victim to a massive memory leak. seeing if we can reboot it
17:17 brion: poking at mw-serve on erzurumi; not responding
16:15 domas: livehacked out 'patrol' link on article views %)
04:02 Tim: added DNS entry for OTRS test
03:19 tomaszf: installed grosley
01:31 Tim: fixed srv76 and the wikimedia-task-appserver package
01:31 brion-busy: syncing r46513 -- fix for categoryfinder, update to fix for Collection
01:14 brion-busy: updating Collection ext -- compat issue with changed category
00:56 brion-busy: stopped apache on srv76 for the moment
00:55 brion-busy: srv76 doesn't have upload5 mounted
00:41 brion: live-hacking out a broken check in getDupeWarning() which broke uploading if you had a duplicate file
00:34 mark: DOM readouts on br1-knams:
br1-knams#sh optic 1
Port Temperature Tx Power Rx Power Tx Bias Current Monitor
+----+-----------+--------------+--------------+---------------+-------+
1/1 24.0078 C 000.7776 dBm 84.360 mA Disabled
1/2 N/A N/A N/A N/A
1/3 37.0000 C -003.4582 dBm -003.8111 dBm 58.470 mA Disabled
1/4 32.0234 C 000.4669 dBm 71.928 mA Disabled
00:22 Tim: synced nagios config
January 28
23:40 mark: s/knams/esams/ in DNS geobackend files
23:25 mark: Deployed fix in /lib/lsb/init-functions on sanger, mchenry, williams and lily which caused (amongst others) Exim reloads (-HUP) to be turned into a kill -TERM (Debian bug #434756)
23:15 mark: Set up basic mail system for OTRS on williams. Still incomplete and needs fine tuning and testing, spam checking is not yet implemented amongst other things.
22:30 mark: Restarted Exim on sanger, disappeared mysteriously
21:50 mark: Raised Dovecot max login process count from 128 to 1024
21:04 brion: merging reupload fixed: r46479, r46483, r46487
20:49 mark: Base OS install finished on williams.wikimedia.org
20:02 brion: merging r46472 (FlaggedRevs autopromote fix), r46464-46476 (feed RTL style fix, re-upload disabled field fix)
18:05 RobH: setup mail relay for wikimedia.cz for Danny and Co  ;]
08:43 domas: s3 replication switched from db1-bin.325:437169827 to db11-bin.026 :79
08:35 domas: s2 rep switched from ixia-bin.150:119337662 to db13-bin.004:79
06:15 Tim: creating backup of db10 on storage2
04:29 brion: svn up'ing and scapping to r46424 consistently
04:22 brion: updating FlaggedRevs to r46422
04:17 brion: merging r46419, r46421 -- search display fixlets
03:51 brion: attempting scap again; tweaking DataCenter.ui.php since the scap syntax checks are whinging about the abstract static method o_O
03:40 brion: scapping to r46413
01:35 brion: svn up'ing to r46413 on test...
January 27
19:28 brion: syncing updates to Collection
19:04 brion: scapping update to AbuseFilter for test. updated its schema...
18:44 brion: db16 lagged 2188s
18:44 brion: restarting slave thread on db16. it got stopped with a lock wait timeout on a page_touched update (wtf?!)
18:43 brion: slave stopped on db16
17:41 mark: knsq1 Up and serving requests with squid 2.7.5
17:25 mark: Trying squid 2.7.5 on knsq1 - might be unstable in the mean time
17:22 mark: Reduced cache_mem on backend esams text squids from 3000 to 2500
16:23 RobH: srv76 had a failed hdd, replaced, reinstalled, and bringing back into rotation
16:18 RobH: srv146 was powered down (heat issue?), powered back up, synced and now in rotation.
16:09 RobH: srv139 didnt have apache running, synced and started
16:01 RobH: srv129 didnt have apache running, synced and started
15:59 RobH: sq11 back online, cleaned
15:40 RobH: srv126 back online. possible bad disk, if it crashes again, the disk needs replacement. (it went read only before, which seems to sometimes happen even when the disks are not bad.)
15:25 RobH: srv76 wont boot up, reinstalling.
15:12 RobH: srv130 coming back online, updated fstab, synced, putting it back in rotation.
15:05 RobH: moved ts-array4 to its dedicated ports, now its kate's problem ;]
14:49 Tim: restarted recompressTracked.php
14:33 Tim: henbane's disk has been full for 8 days due to donate-campaign.log, starting cleanup
14:18 Tim: killed recompressTracked.php
14:08 domas: removed unnecessary ms1 stat from CommonSettings.php. Recovery observed. (
diff
13:44 mark: CARP weight redistribution caused large load spike in upload backend request, causing ms1 overload, probably causing issues on apaches via NFS, etc etc...
13:29 mark: Lowered CARP weight from 10 to 5 for sq1-10.wikimedia.org, from 15 to 10 for sq11-15
08:20 Tim: depooled db3 and db4 to improved recompressTracked speed
07:09 Tim: There was a bug in recompressTracked.php which caused the last batch of orphans for any given wiki to be skipped. Re-running recompressTracked.php to repair it.
05:55 Tim: killed all job runners, changed the job-runners group to srv151-180, started job runners on those servers
05:50 Tim: migrated job runner scripts to ubuntu and started job runners on srv110-119
05:29 Tim: started job runner on srv89
02:13 brion: updating extensions/AbuseFilter/Views/AbuseFilterViewList.php (mysql 4 compat issue)
02:04 brion: installed release versions of mwlib on erzurumi and restarted. these should have updated localizations
01:48 brion: turning AbuseFilter on on test.... having some mysql 4.0 compat issues. poking
01:47 brion: srv31 seems very sad; slow/borked login?
01:39 brion: scapping to update AbuseFilter to current
01:27 brion: prepping testing of AbuseFilter on test.wikipedia
00:46 brion: enabling Collection also for de.wikisource per frank's req passed on from community
00:36 brion: adding NS_HELP to $wgCollectionArticleNamespaces
00:12 brion: Collection extension being enabled on dewiki
January 26
22:39 RobH: UK Chapter wiki setup per
22:18 RobH: pushed apache changes for uk chapter wiki
22:13 RobH: updated dns for uk chapter wiki
19:29 brion: going to update Collection to current trunk in prep for further activation today
17:01 RobH: added support for the phone server to dns
January 25
12:18 mark: Announcing routes to AS16265 again
10:17 domas: our deadlocks are described in X4240 manuals. the fix is either disabling MSI or setting 'options forcedeth max_interrupt_work=15' in modprobe.conf.
product notes
09:31 domas: db17 live, with 2.6.28.1 kernel
January 24
14:53 domas: db16 and db17 deadlocked:
11:43 domas: db17 stuck at nc/tar/kswapd:
10:36 domas: took out db4,db5,db8 for cloning
January 23
18:04 brion: putting load back on db3, it's up to date
17:49 brion: taking some load off db3 until it catches up
17:46 brion: also killed a WantedTemplatesPage::recache query which had been running for a day. that ain't sustainable. :P
17:44 brion: domas restarted morebots a few minutes ago :D
17:43 brion: syncing update to ApiQueryBacklinks.php with the USE INDEX that was added for this problem
17:41 brion: killing some stray backlinks queries
17:38 brion: ~1-hour lag on db3
morebots is broken/down? unable to edit
January 22
00:10 brion: whitelisting .ott (OpenDocument templates) for private-wiki uploads
January 21
20:25 RobH: some tinkering on http redirects, rollback
17:51 RobH: setup https for wikitech
17:23 RobH: setup wikitech to stream weekly backups to tridge
10:29 domas: db28 powered down because of temperature reading over threshold (45C???)
January 20
21:45 RobH: killed some run away processes on db9 that were killing bugzilla
21:44 brion: stock long queries on bz again. got rob poking em
20:31 brion: putting $wgEnotifUseJobQ back for now. change postdates some of the spikes i'm seeing, but it'll be easier to not have to consider it
20:19 mark: Upgraded kernel to 2.6.24-22 on sq22
19:57 brion: disabling $wgEnotifUseJobQ since the lag is ungodly
17:58 JeLuF: db2 overloaded, error messages about unreachable DB server have been supported. Nearly all connections on DB2 are in status "Sleep"
17:21 JeLuF: srv154 is reachable again, current load average is 25, no obvious CPU consuming processes visible
17:10 JeLuF: srv154 went down. Replaced its memcached by srv144's memcached
03:02 brion: syncing InitialiseSettings -- reenabling CentralNotice which we'd taken temporarily out during the upload breakage
01:50 Tim: exim4 on lily died while I examined reports of breakage, restarted it
January 19
21:28 mark: Distribution upgrade on lily complete
21:27 mark: Letting mail through again on lily
21:01 JeLuF: Bugzilla didn't work. Some long-running (>3h) requests were locking some tables. Killed all long running jobs.
20:05 mark: Put mail delivery on hold on lily
20:03 mark: Upgrading lily (Mailing list server) to Ubuntu 8.04 Hardy
14:04 mark: Set a static ARP entry for 85.17.163.246 on csw1-esams to see if it helps with the inbound packet loss effects
January 18
20:25 mark: Cut outbound announcements to AS16265 to counter the inbound packet loss on that link
17:50 river: started copying ms1:/export/upload to ms4
00:21 Tim: restarted apache on srv158,srv177,srv106,srv66,srv109,srv140,srv86,srv90,srv133,srv172
00:19 Tim: cleaned up binlogs on db1
January 17
12:43 mark: Shut down transit link to 16265 due to intermittent packet loss
January 16
23:25 brion: activating Drafts extension on testwiki
21:18 brion: updating english/default wikibooks logo
bugzilla:17034
19:50 brion: uncommented srv101 from apache nodelist
19:41 mark: Fixed authentication on srv101, and mounted /mnt/upload5
19:25 brion: srv101 is commented out of 'apaches' node group so didn't show up on my earlier sweep
19:23 brion: poking around, srv101 at least is missing upload5 mount still
January 15
21:16 brion: seems magically better now
20:48 brion: ok webserver7 started
20:43 brion: per mark's recommendation, retrying webserver7 now that we've reduced hit rate and are past peak...
20:28 brion: bumping styles back to apaches
20:25 brion: restarted w/ some old server config bits commented out
20:24 brion: tom recompiled lighty w/ the solaris bug patch. may or may not be workin' better, but still not throwing a lot of reqs through. checking config...
19:48 brion: trying webserver7 again to see if it's still doing the funk and if we can measure something useful
19:47 brion: we're gonna poke around
but we're really not sure what the original problem was to begin with yet
19:39 brion: turning lighty back on, gonna poke it some more
19:31 brion: stopping lighty again. not sure what the hell is going on, but it seems not to respond to most requests
19:27 brion: image scalers are still doing wayyy under what they're supposed to, but they are churning some stuff out. not overloaded that i can see...
19:20 brion: seems to spawn its php-cgi's ok
19:19 brion: trying to stop lighty to poke at fastcgi again
19:15 brion: looks like ms1+lighty is successfully serving images, but failing to hit the scaling backends. possible fastcgi buggage
19:12 brion: started lighty on ms1 a bit ago. not realyl sure if it's configured right
19:00 brion: stopping it again. confirmed load spike still going on
18:58 brion: restarting webserver on ms1, see what happens
18:56 brion: apache load seems to have dropped back to normal
18:48 brion: switching stylepath back to upload (should be cached), seeing if that affects apache load
18:40 brion: switching $wgStylePath to apaches for the moment
18:39 brion: load dropping on ms1; ping time stabilizing also
18:38 RobH: sq14, sq15, sq16 back up and serving requests
18:38 brion: trying stopping/starting webserver on ms1
18:27 brion: nfs upload5 is not happy :(
18:27 brion: some sort of issues w/ media fileserver, we think, perhaps pressure due to some upload squid cache clearing?
18:23 RobH: sq14-aq16 offline, rebooting and cleaning cache
18:16 RobH: sq2, sq4, and sq10 were unresponsive and down. Restarted, cleaned cache, and brought back online.
04:32 Tim: increased squid max post size from 75MB to 110MB so that people can actually upload 100MB files as advertised in the media
January 14
19:21 mark: Lower preffed paths from 13030 that were learned at NYIIX
18:44 brion: updated wikitech to current SVN and rebuilt text search index for new server to fix short words
18:30 RobH: removed the sysop and bcrat add/remove from bcrat permissions for eswiki
18:22 RobH: added groups for eswiki again per
16:28 RobH: added rollbacker group per
January 13
23:32 Tim: fixed NRPE on db29
22:56 Tim: cleaned up binlogs on db1 and ixia
22:54 brion: poking WP alias on frwiki
bugzilla:16887
21:11 RobH: setup ganglia on erzurumi
20:42 brion: setting all pdf generators to use the new server
20:40 brion: testing pdf gen on erzurumi on testwiki
20:35 RobH: setup erzurumi for dev testing
20:35 RobH: some random updates on
server roles
to clean it up
19:37 mark: Restored normal situation, with 14907 -> 43821 traffic downpreffed to HGTN to avoid peering network congestion
18:40 mark: Retracted outbound announcement to all AMS-IX peers, 16265 and 13030 to force inbound via 1299
18:25 mark: Undid any routing changes as they were not having the desired effect
18:14 mark: Prepended 43821 twice on outgoing announcements to 16265 to make pmtpa-esams path via nycx less attractive
11:38 Tim: reducing innodb_buffer_pool_size on db19, db21, db22, db29
09:15 Tim: restarting mysqld on db23 again
09:09 Tim: restarting mysqld on db18 again
07:08 Tim: removed db23 from rotation, since I'm bringing it up soon and it will be lagged
07:02 Tim: shutting down mysqld on db18 for further mem usage tweak
06:53 Tim: fixed broken /etc/fstab on db23 via serial console
06:42 Tim: restarting db23
00:08 Tim: repooling db18, has caught up
January 12
21:50 brion: testing a scap after touching MessagesWuu.php to see if that clears borked serialized btis
21:22 RobH: erzurumi installed
21:00 tomaszf: moved erzurumi to vlan 101 on asw-a4-sdtpa
17:55 brion: temporarily stopped apache on srv78, srv118
17:54 brion: srv78 doesn't have upload5 mounted
17:54 brion: srv118 doesn't have upload5 mounted
17:46 RobH: fixed some settings for flaggedrevs in
17:31 RobH: per brion commented out db18 in db.php cuz its making other crap lag too much (
bugzilla:16993
17:26 RobH: updated flaggedrevs.php for
17:23 RobH: updated apache config on yongle for wap => mobile forwarding oversight per
17:05 brion: db18 is backlogged 191k seconds. depooling it; complaints of hella lag
15:32 Tim: restarted mysqld on db18 with reduced memory usage, repooled
14:12 Tim: rebooting db18
13:20 Tim: depooled db18 (is down)
January 10
16:08 domas: rotated 300g sampled-1000.log ;-)
07:09 river: applied current OS patches to ms2 and rebooted
01:21 Tim: restarted apache on srv95,srv114,srv37,srv49
01:19 Tim: cleaned up disk space on db1. Still looks suspiciously like the master...
00:33 brion: redirecting old bylaws.pdf to wiki page bylaws on wikimediafoundation.org (foundation.conf update)
00:13 brion: reconfigured exim on wikitech to hopefully actually send mail out. whether it reaches anything, we'll see
00:12 tomaszf: turned off fundraising banners
00:08 brion: installed a mail server on wikitech server, hopefully
January 9
22:40 brion: updating missing.php for "missing site" error page (bug 11125)
22:26 tomaszf: updated triggers on db9 for civicrm2 to not show refunds in the public reporting table
21:31 brion: moving wikitech.leuksman.com dns entry to point to my new server in prep for retiring the old one; redirect to wikitech.wikimedia.org should remain intact.
19:54 RobH: fixed timezone setting for wikimedia norge per
19:49 RobH: updated yongle so *.wap redirects to *.mobile.wikimedia.org per
18:36 RobH: disabled anon editting of the wikimania2008 wiki per
18:22 RobH: enabled flaggedrevisions on ptwikisource per
16:19 RobH: enabled flagged revisions on hewikisource per
16:00 RobH: updated redirect for wikimania.wikimedia.org to 2009 site instead of 2008 site per
January 8
22:08 brion: putting db12 back in service, caught up
21:42 RobH: changed the ip address for the management interfaces on sq31-sq50
21:30 RobH: updated dns with the squids and srv mangement info for pmtpa
21:16 brion: taking load off db12 while it updates
21:15 brion: killing stuck query threads on db12 (lagged 13k seconds)
20:23 RobH: updated dns removing a large number of decommissioned servers from records.
20:08 RobH: pushed updates to dns for mangement ip allocations, changed mangement ips of search8-search12
19:42 RobH: changed the mangement ip addresses of db5-db10 to fit into current ip scheme
18:20 RobH: updated dns for the management name resolution of db11-db30
18:11 RobH: ms5 has lom access enabled and is ready for testing. (Only one ethernet connection in lieu of the typical 3 on the thumper/thors)
15:50 RobH: srv118 reinstalled
15:46 RobH: srv136 is borked. Even after reinstall, it will run for a few minutes, then lock hard. Going to RMA it.
15:38 RobH: reinstalled srv136 and srv118 cuz they were pissing me off (a valid reinstallation reason if there ever was one.)
15:08 RobH: and srv118 back down, thing is borked.
15:06 RobH: srv118 back online and serving requests.
15:01 RobH: pushed db13 back into cluster, same with db14, from yesterdays work
14:26 RobH: srv101 back online and in lvs
14:15 RobH: reinstalled srv101, installing wikimedia-task-app packages now
06:37 JeLuF: rebooted db18. Mysqld was stuck but couldn't be killed.
04:08 Tim: migrated all locked wikis from $wgReadOnly(File) to permissions-based locking, so that stewards can edit the alternate project links, and so that various MediaWiki components don't break on page view
03:57 river: set up ms3/ms4 with solaris 10 update 6
January 7
22:50 RobH: db13 and db14 are replicating but not in the cluster (not sure if they are caught up)
22:35 RobH: updated power strip information for ps1-a1-sdtpa and balanced load
22:35 RobH: reseated mrj cable for csw1-sdtpa_1/13
21:36 RobH: started up db13 and db14
21:19 RobH: updating firmware on db13-db14
21:14 RobH: shutdown db13 and db14 to fix lom lockup issue.
20:52 RobH: depooled db13 and db14 in db.php to reboot them and fix the SP lockup issue.
20:49 RobH: updating firmware on db16.
20:43 RobH: started mysql back up on db15
20:42 RobH: cold reset of db16 to resolve lom issue. will update firmware upon boot.
20:39 RobH: swappned hostnames on ms3 and ms4, updated racktables and dns to reflect change
20:24 brion: disabled wikidiff2 on wikitech since it's not installed, and this apparaently is nicely broken
20:21 RobH: db15 now responsive to lom and ready to be re-integrated into the cluster
20:12 RobH: db15 cold reset fixes the LOM non-responsive issue. Upgrading its firmware to prevent future issues.
20:06 brion: removed stray whitespace from wikitech config file which was breaking rss feeds
19:22 mark: Possibility that esams LVS was overloaded, split over 2 boxes (fuchsia & mint)
19:19 RobH: ms3 and ms4 are accessible via LOM and ready for setup/deployment
19:05 RobH: updated dns for ms3-ms5, updated dns for mangement for all media servers.
19:03 brion: touching MessagesZh.php and re-trying scap; may not have properly updated
17:40 brion-plague: scapping -- merged r45507 zh specialpage alias fix to live. also r45499 (revert of Cite error thingy) seems to already have been merged
13:58 Tim: ran updateAutoPromote.php on all flaggedRevs wikis
13:41 Tim: scap
13:21 Tim: repooled db3 and db4
12:47 Tim: recompressTracked.php complete. Recompressed 628 GB of data to 30GB, a 21x reduction over per-revision compression.
04:36 brion-codereview: svn up'ing testwiki to r45489
January 6
16:01 mark: Changed 'knams' into 'esams' in DNS, kept a lot of old names in place
15:26 Tim: cleaned up binlogs on db1
13:09 mark: Did some Traffic Engineering on the Amsterdam network
11:58 Tim: installed NRPE on new ES servers
11:47 domas: added db29 to s3 duty
11:32 Tim: locked clusters 18 and 19, updated nagios
11:27 Tim: fixed lack of schema on srv161
11:21 Tim: retired cluster18 from the write list, added cluster20 and cluster21
11:15 Tim: cleaned up binlogs on srv105
00:04 tomaszf: built out eiximenis with ubuntu-8.04 for mobile server
January 5
20:47 brion: re-updating SpecialSearch.php and MWSearch.php for better fix of the XSS
20:40 brion: updating SpecialSearch.php for XSS issue
20:00 RobH: wikitech is moved to new host. Still needs HTTPS setup. Redirects from old host are in place.
13:17 domas: setting up db24-db26 LVMs per
12:56 mark: Brought down BGP transit session to AS 1145 / Kennisnet
12:29 domas: db16 had our special deadlock, didn't come up after reboot, SP not responding, needs datacenter activity
12:07 domas: upgraded BIOS firmware on db29,db30 and accidently on db19 (damn .29 ip :)
11:47 domas: added 208.80.152.185 to noc.wikimedia.org vhost ServerAlias
10:33 mark: Brought BGP session to AS 16265 back up
00:04 Tim: cleaned up binlogs on ixia and db1
January 4
17;08 mark: Restored traffic to esams
16:38 mark: Moved route sourcing from br1-knams to csw1-esams
15:55 mark: Moving esams traffic to pmtpa (scenario knams-down)
January 3
23:57 mark: Restored AAAA record on upload.wikimedia.org
12:04 domas: db17, db18 had OS/firmware updates, rebooted
10:50 domas: db19 RAID complaining about temperature, check-raid/kswapd/mysqld deadlock. upgrading RAID firmware, rebooting, etc
01:23 Tim: removed db3 and db4 from rotation again, to allow recompressTracked to go faster
00:36 Tim: depooled db19, is down
00:32 Tim: restarting recompressTracked with an extra wfWaitForSlaves()
00:08 Tim: repooled db3 and db4
January 2
22:35 Tim: depooled db3 and db4 temporarily
21:56 Tim: killed recompressTracked for now, not waiting for slaves properly. db3 and db4 lagged.
20:54 mark: Set db4 s1 load to 0, 4368s lagged
00:42 Tim: restarting recompressTracked.php on hume
January 1
20:34 brion: live-merging file delete fatal error fix from r45278
19:47 brion: bumped meter image to 7
01:59 brion: scapping!
01:39 brion: svn up'ing test.wiki to r45274
00:55 brion: svn up'ing on test.wikipedia
December 31
18:40 brion: fixed old whygive.wikimedia.org blog by copying de-conflicted WordPress source files out of the active blog where we fixed it after the 2.7 upgrade
December 30
23:02 RobH: is leaving on a jet plane, weeeeeeeee.. in 8 hours.
23:01 RobH: all knams squids are now online.
22:49 RobH: knsq23-26 back in rotation, 3 more to go.
22:33 RobH: enabled knsq16-knsq22 in lvs, almost time to go back to hotel and die.
22:22 brion: attempting to purge affected pages on dawiktionary, dawiki
22:21 brion: taking dawiki, dawiktionary out of read-only because the rest of the fixes won't work until it's disabled :P
22:14 brion: poking diff version in live DifferenceEngine.php to eliminate bogus cache entries for dawiki/dawiktionary
22:11 RobH: stopping and clearing the cache on knsq16-knsq30.
22:06 brion: trying it again, but this time with the right variable names
22:02 brion: attempting to clear revision text loading cache entries for dawiktionary, dawiki
21:47 brion: live-merging r45206 so
bugzilla:16841
corrupted entries will be loaded properly on dawiki/dawiktionary. need to clear revision, diff, parser caches...
21:15 brion: locking dawiki, dawiktionary ($wgReadOnly) pending encoding fix
20:07 brion: killed recompressTracked.php processes on hume pending investigation of encoding breakage
20:02 brion: commenting ariel out of pmtpa also
19:58 brion: trying to clear no-longer-in-dns hosts from ALL node group
19:57 brion: PLEASE SAY WHAT SERVER YOU'RE RUNNING BATCH PROCESSES ON IF THEY'RE NOT ON ZWINGER. thanks
19:56 RobH: power disconnection for primary routing rack in esams. power restored, and totally was not robh's fault regardless of what lies mark may say to the contrary.
19:54 brion: encoding issues reported with some old edits on dawiki. wondering if this is recompression-related?
18:46 brion: added PMTPA nameserver back in mayflower's resolv.conf so DNS actually works on it until things are fixed
17:42 brion: internal DNS for knams seems to be down (at least on mayflower), this is breaking at least SVN update notifications
17:14 brion: updating logo for pmswiki
bugzilla:16587
13:29 Tim: starting recompressTracked.php on all wikis
11:22 mark: Shutting down knsq16-30
10:59 mark: In case of overload problems, please move traffic to pmtpa (scenario knams-down)
10:54 mark: Depooled knsq16-30
10:47 mark: Set DNS timeout on fuchsia (LVS) to 1s, PyBal timeout to 8s
10:21 mark: Unracking pascal, mint, lily
09:57 Tim: testing recompressTracked on huwiki
09:38 mark: ts-array3/A --> yarrow/0
09:23 TimStarling: testing recompressTracked on testwiki
09:20 mark: hemlock/eth1 <--> clematis/eth1
09:17 mark: ts-array2 -> zedler scsi B, ts-array1/0 -> zedler scsi A
08:47 Tim: running FlaggedRevs/maintenance/clearCachedText.php on all FlaggedRevs wikis
December 29
11:24 mark: Shutting down and unracking mayflower (subversion)
11:21 mark: Temporarily disabled AAAA record upload.wikimedia.org for ipv6 participants
11:19 mark: Unracked fuchsia
11:16 mark: In case of overload problems, move traffic to pmtpa!
11:11 mark: Moving all LVS to mint
09:56 mark: Depooled knsq8-15
09:56 mark: Unracked knsq1-7
09:43 mark: Repooled knsq23-30, depooled knsq1-7
09:23 mark: Depooled knsq23-30
08:47 Tim: deleted some binlogs on srv108.
04:50-05:32 Tim: set up external storage on the remaining 9 servers in srv151-186: srv160, srv161, srv162, srv172, srv173, srv174, srv184, srv185, srv186
03:41 Tim: running orphanStats.php on all wikis
03:26 Tim: restarted apache on srv33, srv146, srv169, srv172
03:00 Tim: cleaned up binlogs on srv105
December 28
21:33 brion: tweaked namespace robot policies for hewiki
bugzilla:16247
20:52 brion: tweaking it correctly this time
20:50 brion: tweaking centralnotice loader path for secure.wm.o
20:20ish brion: copied a couple image files for Bugzilla skin to local dir, since Firefox 3.1b whinges about loading images via http: from an https: page
18:21 brion: we've been getting reports of difficulties reaching PMTPA via Level3
18:03 brion: updating thwiki logo
bugzilla:16008
17:54 mark: csw1-esams racked and configured; link established with br1-knams
12:14 mark: Moving equipment to EvoSwitch
11:55 mark: Moved udpmcast from pascal to lily
11:48 mark: sage stays at knams, to be racked into J-13 later
11:44 mark: Unracking ragweed
11:38 mark: Unracking hawthorn
11:37 mark: Unracking sage
11:37 mark: Unracked csw1-knams
11:25 mark: Directed traffic back to knams
10:52 mark: knams network should be back up
09:05 mark: Moving knams traffic to pmtpa
December 27
21:50 brion: removed stale sitemaps dirs for several private wikis
December 26
00:50 Tim: started mysqld on db19, repooled
00:44 Tim: got connection on db19 and assumed it was still broken, initiated shutdown
00:44 domas: db19 had jfs/kswapd/etc deadlock, came up after reboot
00:34 Tim: noticed db19 was down, depooled it.
December 25
23:59 domas: restarted db19 with sysrq without telling anyone
19:37 brion: adjusted subpage namespaces for arbcom_enwiki
19:11 brion: disabled magic_quotes_gpc on yongle -- mobile.wikimedia.org gateway doesn't compensate for quoted input. :P
19:09 brion: merry christmas!
01:09 brion: re-running SVN metadata import for CodeReview to fix comment encoding (
bugzilla:16640
December 24
21:55 brion: merging r45005 (restoring default font for Safari textarea)
December 23
23:35 brion: svn up'd to r44990 (serialization updates broken by Setup.php change)
23:28 brion: starting scap!
23:24 brion: svn up'ing to r44989, prep for scap!
22:41 brion: think i tweaked scap script to update skin files on upload.wikimedia.org ...hopefully :)
22:09 brion-codereview: svn up'ing test.wikipedia.org to r44982 -- DO NOT SCAP UNTIL TESTED!
02:38 Tim: cleaned up binlogs on db1, db2. Removed cluster19 from the write list, it's almost full.
02:28 brion: clearing out bogus page_restrictions entries (
bugzilla:16629
December 22
22:56 brion: updated timezone for huwikinews (bugzilla:14343)
December 21
03:05 Tim: depooled db4 temporarily to speed up a long running trackBlobs query
December 20
01:08 brion: starting a cleanupImages run on all wikis
00:57 brion: set UI lang fo rmainpage on meta
bugzilla:16701
December 19
23:52 brion: removing MessageCache::get profiling hack, all done
22:16 brion: adding profiling hack for MessageCache::get
13:48 mark: Found knsq12 turned off, brought it back up
12:17 mark: Unracking knsq15 to make room for the new router
08:53 Tim: changed crontab on hume to run rebuildTemplates.php every 30 minutes instead of every 10 minutes, since it's taking about 30 minutes to finish each run
07:42 Tim: started trackBlobs.php running on hume, for all wikis
December 18
23:16 brion: updating MessagesLij.php, MessagesMt.php -- namespace breakage
21:53 brion:
bugzilla:16597
spam regex update
21:01 RobH: added wikitech subdomain for future setup/migration of wikitech mediawiki
20:33 RobH: added commons to meta imports allowed per
14:50 RobH: pushed dns change to correct spence.mgmt.pmtpa.wmnet.
03:09 TimStarling: killed long-running query on db9, 5762 seconds, plain select query probably with a read lock held by the thread, all read queries were waiting for the lock
02:27 TimStarling: deleted binlogs on srv105 and srv108
01:16 brion: briefly experimented with changing wgLogo on testwiki via Configure and it didn't explode. yay! setting it back to default and just letting it be. only stewards can edit config, and only wgLogo is configable atm.
01:12 brion: testing Configure on testwiki only
01:10 brion: created test Configure ext tables in 'wikiconfig' db
00:49 brion: scapping for update of Configure extension prior to small-scale test deployment
00:48 Danny_B: wikibugs-l stopped to send mails to wikibugs-irc mailbox due to excessive bounces. reenabling sending again
00:28 RobH: fixed part of the revert for lucene that i missed.
00:24 RobH: reverted lucene.php changes from rainman's testing.
December 17
23:18 RobH: more lucene changes
22:36 brion: applied fix for Android browser on mobile gateway (also did the pl language setup recently)
22:05 RobH: more lucene.php changes
21:12 RobH: additions to lucene.php per rainmain
20:39 mark: Corrected LVS service IPs on search2, search10-12
20:03 brion: hacked mw-serve init script on yongle into shape. will commit it in a bit and update docs
19:38 brion: pdf server seems to have eaten all temp space on yongle. clearing...
19:26 mark: Set up search2, search8-12
18:57 RobH: pushing dns changes for new misc. servers management resolution
18:30 RobH: updated lucene.php with rainman to do things that I really do not get but he knows about.
16:28 RobH: new servers auth1, nfs2, streber and williams are racked, IP's allocated, DRAC working. No DHCP entries or OS installed yet.
16:08 mark: restarted lighttpd on zwinger
15:59 RobH: added williams to dns records, updated dns
15:50 TimStarling: removed some binlogs on ixia
01:17 brion: scapping a couple more fixes to r44698
00:36 brion-codereview: srv126 is borked -- read-only filesystem
00:23 brion-codereview: scapping to 44696
00:15 brion-codereview: svn up'ing on test...
December 16
23:09 brion-codereview: disabling FixedImage extension -- was used for old 2006 and 2007 fundraisers; images no longer exist and are not applicable to current fundraisers
20:34 RobH: ariel is dead, will decommission later.
20:29 RobH: ariel is fubar, rebooting and investigating.
20:25 RobH: restarted services on sq13
20:21 RobH: took down sq13 to clean its cache
20:09 RobH: replaced bad /c0/p0 in amane
19:45 RobH: setup drac access for nfs1, brewster, auth2, dobson, eiximenis, erzurumi, fenari, grosley, loudon, singer, & spence. The other 3 misc. servers will be setup later. OS not installed, just remote access setup and IP space allocated. (Not setup in DHCP yet.)
18:47 brion: applying temporary resource limit lift on enwiki for an IP for workshop in SF
17:40 RobH: updated dns for misc. servers project.
01:08 brion: deploying r44643 update to CodeReview subversion proxy (swapped encoding protocol to avoid bugs in json_decode with some diffs)
00:04 brion: running cleanupTitles.php in bg on all wikis...
December 15
23:20 brion: going to test fixes for FiveUpgrade.inc to back cleanupTitles.php, cleanupImages.php etc
22:21 RobH: changed settings on metawiki to allow banned users to edit their talk pages per
21:25 brion: reenabling handheld skin setting, was turned off during overload emergencies on 11-17
21:13 brion: rsyncd appears to be running on srv56. does anything else need to be done for index updates?
20:10 brion: yongle hanging again, restarting apache
18:58 RobH: started rsync daemon on srv56 per rainman
18:35 RobH: setup new planet per
01:39 brion-weekend: applying API deletion log fix from r44541 (
bugzilla:16626
00:09 rainman-sr: rsyncd is not running on srv56, updates for wikis served by old indexer halted since Oct7. Run rsync --daemon on srv56
December 14
02:04 Platonides: Connections timing out
December 13
02:04 brion: applied patch-rfb_ratings.sql to flaggedrevs wikis
01:46 brion: did some debugging on RatingHistory graph generation with Aaron and got it working yay!
December 12
22:47 brion: patched Bugzilla so we can exclude CC-only mails from wikibugs-l ([bugzilla:15585]])
21:52 brion: scapping to r44509
19:19 brion: put all the themes and plugins and patches back on wordpress for blog.wm.o. whee
19:15 brion: restarted apache on isidore while fiddling with php error logging settings and blog started magically working again. sigh. going back to tweak its config back to normal
18:04 brion: we managed to fix the svn update conflict on blog.wm.o (to wordpress 2.7) but it's still showing main page as blank
17:42 mark: Telia connection / BGP session was up for 20 hours; problem seems resolved. Removed route filters
00:29 brion: bumping to r44485 for more NS fixes for ms, ast
00:12 brion: scapping bump to r44484, fixing a few issues w/ hu
00:06 brion: updated wikibugs irc script to r44483, fixes issues w/ users w/o real name setting
December 11
23:19 brion: shutting down srv118; bad config. missing upload5 mount, seems to have bogus authenticatin (local su to root fails with "Authentication service cannot retrieve authentication info")
23:10 brion: restarted apache on 134, it's scary/corrupt
22:55 brion: manually syncing updated skin files to upload.wm.o ...
22:53 brion: scapping to r44474
21:31 brion: don't sync yet; RC regression in r44033 being worked on
19:41 brion-codereview: removed conflicting live profiling hack from AutoLoader.php. Put this stuff in SVN, huh guys?
19:39 brion-codereview: applying flaggedrevs schema updates
19:38 brion-codereview: starting svn up for testwiki
13:41 mark: configured asw-a4-sdtpa and asw-a5-sdtpa, but no link
10:41 mark: bart out of disk space, removed some old cruft (mailman)
December 10
23:50 RobH: pulled srv76 due to two dead fans (yay for da bot)
23:35 RobH: srv78 reinstalled and in apache pool
22:57 RobH: srv78 kernel panic, old FC install, pulled for reinstall
22:49 RobH: sq1. sq3, sq6 cache cleaned and back online serving requests.
22:35 RobH: sq1, sq3, sq6 all unresponsive to console, flashing leds on kvm. rebooted.
20:40 RobH: srv118 installation completed.
20:00 RobH: reinstalled srv118 after replacing dead parts. installing packages now.
19:48 RobH: started rebuild of storage1 /c1/p0 into array
19:47 RobH: replaced disk /c1/p0 in storage1. /c1/p13 is now bad as well, placing rma for it.
19:14 RobH: db13-db16 responsive to ssh.
19:13 RobH: db15 rebooted.
18:05 RobH: temp probes installed in a3-sdtpa
December 9
18:46 RobH: fixed group names in add/remove groups per
18:42 RobH: updated some settings for no.wikimedia.org and pushed to cluster.
15:23 RobH: backedup blog frontend/database and upgraded to 2.6.5 successfully
14:21 RobH: updated InitialiseSettings for nowikimedia wiki
06:47 Tim: srv146 did not have /mnt/upload5 mounted. Fixed.
02:03 brion: dropped loading of obsolete RenderHash ext (bug 16114)
December 8
23:30 RobH: updated enwiktionary group settings per
23:24 brion: updating Oversight for bug 16065
22:44 RobH: no.wikimedia.org is now functioning per
22:35 RobH: made changes to InitialiseSettings.php for cswikisource per
21:37 RobH: authdns-update for no.wikimedia.org
21:20 RobH: running sync-common-all for wikimedia norge (found the php error)
21:01 RobH: its all back up now.
20:59 RobH: I stupidly crashed the site with a php typo, rolling back my changes since i was ignorant and did not php -l ;_;
20:58 RobH: setup wikimedia norge wiki per
19:23 brion: updating OggHandler for fix for bug 15920 (chopped oggs)
15:57 mark: Set up mirroring of traffic of e7/2 to e7/14 for testing the fiber patch loop/optics
13:16 Tim: added some IWF proxies to the trusted XFF list. These proxies are probably about 30% of the IWF traffic, the other 70% comes from proxies that pass through the XFF header without adding the client address.
December 5
22:42 domas: srv47 is running scaler usr.sbin.apache2 aa profile in learning mode
22:33 RobH: sq50 reinstalled and back in rotation
22:25 RobH: finished setup on srv146, back in apache pool
21:32 RobH: setting up packages on srv146
21:32 RobH: reinstalling sq50
21:27 brion: pointing SiteMatrix at local copy, not NFS master, of langlist file
19:19 RobH: added sq48, and sq49 back into pool. sq50 pending reinstallation.
18:58 mark: depooled broken squids sq1 and sq3
18:26 RobH: depooled sq48-sq50 for relocation
18:17 RobH: added sq44-sq47 back into pybal, relocation complete.
17:45 brion: sync-common-all to add w/test-headers.php
17:28 RobH: shutting down sq44-sq47 for relocation.
17:27 RobH: sq41 - sq43 back online.
17:17 RobH: sq40 oddness, but its back up now
16:44 RobH: accidentally pulled power for sq38, opps!
15:36 RobH: removed sq41 - sq43 from pybal to relocate from pmtpa to sdtpa
15:34 domas: srv178 running usr.sbin.apache2 aa profile in complain mode
15:34 RobH: removed sq40 from pybal to relocate from pmtpa to sdtpa
December 4
22:50 domas: job runners are no longer blue on ganglia CPU graphs :(((((((
22:45 domas: fc4 maintenance, reniced job runners to 20 (10 behind apaches), installed apc3.0.19 (APC3.0.13 seams to have hit severe lock contention/busylooping at overloads)
22:04 RobH: re-enabled sq38 in pybal. all is well
22:02 RobH: fired sq37-sq39 back up
21:58 RobH: shutdown sq37-sq39, cuz I need to balance the power distribution a bit better.
21:40 RobH: sq38 is trying to break my spirit, so i reinstalled it to show it who is boss (me!)
21:02 RobH: setup asw-a4-sdtpa and asw-a5-sdtpa on scs-a1-sdtpa
20:52 mark: Increased TCP buffers on srv88 (a Fedora), matching the Ubuntus - Fedora Apaches appear to get stuck/deadlocked on writes to Squids
19:39 RobH: pulled sq38 back out, as it is giving me issues. need to fix the msw-a3-sdtpa before i can fix sq38.
19:35 RobH: added sq38, sq39 back into pybal
19:25 RobH: added sq36, sq37 back into pybal
18:14 RobH: I need to stop forgetting about lunch and stop working through it, oh well.
18:13 RobH: depooled sq36-sq39 for move from pmtpa to sdtpa.
18:12 RobH: some tinkering with lvs4 and idleconnection timer was fixed by mark.
17:46 RobH: racked sq21-sq35 in sdtpa-a3. added back to pybal.
16:31 RobH: depooled sq31-sq35 from lvs4 to move from pmtpa to sdtpa
15:15 RobH: reinstalled storage1 to ubuntu 8.04, left data partition intact and untouched.
December 3
23:46 JeLuF: performing importImage.php imports to commons for Duesentrieb
19:13 RobH: tested i/o on db17, issue where it pauses disk access is gone.
19:02 mark: Shutdown TeliaSonera (AS1299) BGP session, the link is flaky resuling in unidirectional traffic only for most of the day
19:02 RobH: replaced hardware in db17, reinstalled.
18:58 mark: Prepared search10, search11 and search12 as search servers
17:26 brion: investigating ploticus config breakage
bugzilla:16085
17:18 brion: ploticus seems to be missing from most new apaches
17:12 RobH_DC: search10, search11, search12 racked and installed.
14:29 RobH_DC: srv136 was unresponsive, rebooted, synced, back in rotation.
December 2
23:57 Tim: added CNAME poke.wikimedia.org for SMS notification project
23:33 brion: scapping to update ContributionReporting ext
23:11 Tim: db7 wasn't deleting its relay logs for some reason, since August 21. Disk critical. Did a reset slave.
20:03 brion: rebuilt public_reporting with fixed encoding
19:53 brion: fudged charsets in triggers for donation db update, let's see if that helps
12:11 Tim: started squid (backend instance) on sq40, stopped for 13 days for no apparent reason
12:08 Tim: restarted apache on srv161, srv122, srv137, attempted on srv123 but it is waiting for dead NFS mount
11:48: srv183 made a miraculous recovery
11:44 Tim: took srv183 out of memcached rotation
11:10-11:35: a spike in backend requests (as seen in lvs3 network) caused the application cluster to overload. Due to the extra threads, srv183 went into swap and died.
10:50 Tim: purged binlogs on ixia and db1 (both critical)
December 1
23:49 brion: sync-common-all'ing to add a wikispecies little icon for sul shared session login, since people keep asking for it :)
20:31 RobH: synced and restarted apache on srv89
19:33 RobH: manually setup apache-check for pybal on srv138, synced, enabled.
19:29 RobH: manually setup the apache_check stuff for srv126 and pybal.
17:19 RobH: synced and restarted apache on srv176 & srv176
17:18 RobH: did the sync and restart thing for apache on srv162
17:16 RobH: synced and restarted apache on srv145
17:13 RobH: synced and restarted apache on srv121 and srv125
17:00 RobH: apache wasnt working on srv102 and srv106, restarted them after syncing
15:10 mark: Restarted stuck pdns_server on bayle, lots of stale selective_answer.py processes
14:44 domas: restored Roma article on itwiki, had orphaned revision entries after deleting it, manually inserted page entry
14:40 mark: Setup Telia transit at knams, but all inbound routes filtered
14:35 RobH: removed images from plwiki flaggedrevs per request from Leinad
November 30
12:14 mark: restarted flapping apache on srv119, looks like memory corruption going on
November 28
18:58 brion-holiday: updating User-Agent blacklists to block 'WebCapture' download tool but not the Library of Congress's www.loc.gov/webcapture/ spider
18:17 yksinaisyyteni: fixed broken upload/deletion/timeline on jawiki
07:11 JeLuF: succeeded to umount /mnt
07:10 JeLuF: killed hanging cron entries on db22. updatedb.mlocate. Might be related to broken mount db16:/a -> /mnt
07:05 JeLuF: killed lots of jobs running on db22, "SELECT /* ApiQueryBacklinks::run XX.XXX.XXX.X */ page_id,page_title,page_namespace,page_is_redirect" which were in status "copying to tmp table"
November 27
13:10 mark: hungover, headache, lack of voice
November 26
17:00 RobH: fixed flaggedrevs to work on ruwikiquote, due to my own mistake in earlier implementation, per
02:38 brion: updated Math.php to r43966 which both fixes 0-byte math PNGs and generates correct URLs *cough*
02:36 brion: broke math temporarily woops
02:29 brion: bumped Math.php to r43965 to hopefully clear out those 0-byte math images (
bugzilla:16440
02:01 brion: updating CentralNotice to r43962 to fix sitenames again :P
01:57 brion: poking centralNotice to r43961 for evil hacks to bump limits temporarily :D
01:31 brion: updating CentralNotice to r43959
November 25
19:25 brion: syncing update to CentralNotice
18:28 RobH: root password changed across all servers. if you didnt get a copy and you should have one, talk to another tech team member.
17:58 RobH: added bayes to allowed nfs connections to storage2, setup fstab for nfs mounts on bayes, revoked shell access for ezachte on storage2 (not needed for what he wanted)
15:49 RobH: updated some points for huwiki flaggedrevs and removed an outdated user group per
15:38 RobH: gave erik zachte login rights to storage2
15:16 RobH: updated dns for survey software
01:35 brion: updating ContributionReporting ext
01:06 brion: forcing a manual run of centralnotice batch update on hume
01:04 brion: retstarting memcached on srv64
01:02 brion: memcache bad on srv64
01:01 brion: notice texts borked on at least wikimedia, wiktionary
November 24
22:45 brion: updated ContributionReporting for some silly bugs
22:20 RobH: portal and portal_talk namespaces added to dvwiki per
22:04 RobH: added two new namespaces to dewikinews per
21:29 RobH: removed a group and granted further permission customization for huwiki per
21:09 RobH: pushed a bad flaggedrevs.php that rendered blank pages for all wiki's with flaggedrevs enabled. fixed it, its working properly now, opps ;]
21:06 RobH: appended page and dossier namespaces into the frwikinews flagged revisions per
20:36 RobH: enabled flaggedrevs on ukwiktionary per
, and ran sync-common-all
20:27 RobH: ran sync-common-all
20:27 RobH: enabled flaggedrevs on dewiktionary
20:07 mark: moved upload knams LVS to mint
20:05 brion: mark is on the case -- LVS overload
19:58 brion: seem to be getting heavy packet loss on some routes to knams
19:47 RobH: changed nameservers for wikimedia.li to WMCH administered name servers.
19:30 RobH: re-enabled arzwiki, cannot find the bugzilla entry.
15:43 RobH: search2 reinstalled and ready for search setup and deployment
November 22
18:28 yksinaisyyteni: srv108 (cluster19) disk full, removing old logs
00:37 brion: bumped php.ini post/file upload limit to 100mb, we'll see how well uploads to that size actually work  :)
November 21
23:11 brion: dropping 'Wikipedia: a non-profit project" banner from rotation, as it's apparently not a winner
22:56 brion: updated logo for cr.wikipedia (
bugzilla:16417
18:34 brion: running updateAutoPromote on new flaggedrevs wikis (
bugzilla:16415
November 20
01:00 brion: updating ContributionHistory
00:34 brion: moving $wgStyleSheetPath back to upload.wikimedia.org
November 19
22:47 brion: updating Tomas skin to r43752 for toc fix
22:41 brion: scapping for ContributionReporting update to 43750 (localization bugs)
22:40 brion: ran namespaceDupes --prefix=D on enwiki and dewiki -- some 'D:blah' pages conflicted with iw prefix 'd' for wiktionary
15:53 brion: updated centralnotice templates with user-targetted lightweight collapsed notice (wish it was for everybody)
01:38 brion: updating CentralNotice to r43697 for anon/user collapsed variants
00:35 yksinaisyyteni: unmounted storage1:/export/upload on all hosts
00:32 yksinaisyyteni: rebooted srv{114,184,166} to fix stuck nfs mount
November 18
23:52 brion: enabling new search UI on testwiki
21:35 brion: switching css/js back to text temporarily to reduce load on upload squids
21:27 brion: request -- squid conf deploy script should do a config file dry-run before actually deploying
21:26 brion: there's load on ms1...
21:25 brion: started more... most... all? squids in squids_uploda
21:24 brion: restarted squid manually on 46
21:17 brion: uploads still borked, we're investigating the squid config problem
21:16 brion: rebuilding squid conf, was a little funky
21:12 brion: updating squid config to send centralnotice to ms1 instead of storage1
20:41 RobH: db24 reinstalled, awaiting domas to do the magic db stuff
20:38 RobH: replaced disk /c0/p7 in amane and started rebuild
20:34 RobH: replaced controller in search2, search2 requires reinstall
20:34 RobH: replaced controller in db24, db24 reinstalling.
20:03 mark: installed gmond on db9 and db10
19:59 brion: scapping to update Collection for regression fix
01:51 mark: Moved text LVS to temporary LVS host lvs4, with an optimized kernel
01:48 brion: setting $wgStyleSheetPath to point at upload.wikimedia.org/skins for non-SSL hosts
01:30 brion: disabling handheld stylesheet; one less thing to load, should have little impact
01:15 brion: another crappy slow squid this time in pmtpa
November 17
22:27 brion: mw-serve having ups and downs as we test the new init script
22:12 brion: started mw-serve on bindery; probably didn't get restarted after yongle crashed. pinging RobH to set up a boot script for it :)
21:16 brion: fixed deferred entries in CodeReview -- schema updater to add 'reverted' accidentally removed 'deferred' :)
20:51 brion: scapping update to r43634
20:43 RobH: re-ran sync-common-all cuz vibber's svn of things messed with the first run. too many ppl, not enough real estate ;]
20:40 RobH: flaggedrevs enabled on ruwikiquote and ruwikisource per
20:39 brion: prepping scap
20:16 brion: updating codereview schema for 'reverted' status
20:11 mark: restarted knsq28 frontend to fix out of socket mem errors
20:00 RobH: FlaggedRevs is now active on french wikinews per
19:59 brion: added hsb, oc subtitles for fundraising video
18:12 RobH: enabled flaggedrevs on huwiki per
(with a long list of custom settings.)
18:01 RobH: FlaggedRevs enabled on alswiki per
17:54 RobH: FlaggedRevs enabled on zh_classicalwiki per
17:29 RobH: resynced apaches after touching initilizesettings.php to make flaggedrevs active on plwiki
15:30 RobH: enabled flaggedrevs on plwiki per
15:21 RobH: sync-common-all because I neglected to check my own work in flaggedrevs.php
15:18 RobH: fixed
flaggedrevs.php typo.
15:07 RobH: set $wgFlaggedRevsTabs = true; on dewiki per
14:44 RobH: enabled flaggedrevs on eowiki
14:21 RobH: flaggedrevs enabled on pt.wikinews.org
November 16
17:24 brion: notices are becoming unborked with new regen. should be done and recached within 10 minutes
17:17 brion: srv120 memcached now functional according to test: 10.0.2.120:11000 set: 100 incr: 100 get: 100 time: 0.0809991359711
17:16 brion: restarting memcached on srv120
17:14 brion: srv120's memcached seems broken: 10.0.2.120:11000 set: 100 incr: 0 get: 0 time: 0.0769970417023
17:05 brion: investigating centralnotice borkage on non-wikipedia sites
November 15
01:03 brion: scapping to r43514 -- regression in CodeReview :)
00:49 brion: enabled UDP->IRC logging for CentralAuth user creations, now that it works instead of crashing PHP
00:45 brion: set up ariel on isidore for blog maint
00:24 brion: starting scap from r42593 to r43512
00:02 brion: preparing for general svn up && scap
November 14
23:24 RobH: updated flaggedrevs: $wgFlaggedRevValues to 4 from 2 for enwikibooks, synced files out to cluster.
23:11 RobH: FlaggedRevs deployed on enwikibooks.
23:00 RobH: removed the crap for specific seroul servers in sync-common-all
22:43 brion: tweaked flaggedrevs.php to have cleaner default behavior
20:27 RobH: setup the backend stuff for arz wiki but not enabled yet.
19:59 brion: yongle is back up! yay
19:48 RobH: fixed authdns-update script, was not rsyncing over the langlist file
19:47 brion: swapping codereview-proxy to isidore since yongle's still down
18:01 brion: requesting reboot on yongle from PM support
17:14 domas: yongle is hanging, apple dictionary searches staled
16:12 RobH: upgraded installation of blog.wikimedia.org and whygive.wikimedia.org to newest stable versions.
15:14 RobH: limesurvey.wikimedia.org online on isidore, initial users created and deployed.
02:03 brion: pascal down again
00:00 brion: syncing to update InputBox extension (note: renamed from inputbox)
November 13
23:41 brion: scapping to update CodeReview
20:26 brion: scapping updates to Collection and ContributionReporting exts
17:33 brion: set up TrevorParscal with access to reporting database so he can grab updates to test with
17:03 river: upgraded ms1 to solaris 10 update 6 + rebooted
09:57 Tim: db10 sync worked just fine this time, it's now replicating all DBs
08:27 Tim: db10 slave start potentially botched, going to re-read the dump and try again
06:43 Tim: loading data into mysqld on db10
06:35 Tim: copy finished, restored r/w on bugzilla
05:43 Tim: copying data from db9 to db10 using: mysqldump -h db9 --master-data --single-transaction --all-databases | gzip --fast > db9-master-data-2008-11-13.sql.gz
05:34 Tim: switching bugzilla into read-only mode for copy to db10. Queries will be denied by user permissions for all tables except logincookies.
05:02 Tim: converting all tables in bugzilla to InnoDB except longdescs
04:53 Tim: converting the MyISAM tables in otrs to InnoDB (the large ones are done already)
04:49 Tim: converted donateblog and newsblog to innodb
03:34 Tim: converted racktables DB to InnoDB
01:59 atglenn: changed wireless network password
01:43 Tim: doing lockless backup of db9 to db10. This will give us a fallback in case disaster strikes during the considerably more complex replication synchronised dump which will follow.
00:45 brion: poked it again
00:29 brion: updating for ContributionReporting
November 12
23:38 brion: XHTML fixes for Collection made the broken 'Random book' link on en.wikibooks.org work again (it very inefficiently loads a giant page of links via JS, and needs it to be clean XML to parse it)
23:16 brion: updated mw-serve
22:48 brion: scapping for Collection ext updates
20:10 brion: updated wgNoticeProject to wikimedia for incubator
18:46 brion: added "uploader" group so we can bump known-good people into being able to upload without waiting for the autoconfirm heuristic
03:14 river: didn't reboot ms1 as its lom is unreachable
01:20 Tim: an error in the cron job on hume caused the r43398 bug to persist until this time, delivering incorrect language text in some site notices.
01:08 Tim: Fixed those 50 servers with a couple of sed commands. Many of them were attempting to send data to larousse and zwinger. Tested srv125.
00:56 Tim: srv125 was spewing PHP fatal errors without reporting them to the syslog on db20. Restarted it. A quick check (ddsh -cM -g apaches -- 'grep -q @syslog /etc/syslog.conf || echo help') suggests that there are 50 apache servers in the same situation.
00:27 Tim: updated
ExtensionDistributor
configuration to account for amane -> ms1 storage move. (bug 16308)
00:13 Tim: some language issues caused by r43398, reverted at 23:50 and resynced in fixed form at 00:12.
November 11
23:47 Tim: restored FlaggedRevs stats job as per
Batch jobs
, removal was not documented.
23:35 Tim: r43398 worked just fine, memory usage dropped from ~4GB to 90MB. Adding rebuildTemplates.php to my crontab on hume, removing it permanently from Brion's on zwinger.
23:28 Tim: updated CentralNotice templates on hume (which has enough memory to do it, unlike zwinger)
22:11 Tim: deleted some binlogs on db1. Remaining disk space is still only 48 GB with negligible InnoDB free space.
16:20 RobH: search2 still down, drives will not detect reliably. Ticket with sun reopened.
15:56 RobH: replaced backplane on search2, reinstalling.
15:13 RobH: srv137 back online. apache and memcached back up.
14:49 RobH: srv100 back online.
10:44 river: removed centralnotice php from brion's crontab as it was breaking zwinger
Core dump suggests the memory usage may be dominated by the localisation cache. wfMsgExt() loads the localisation for the requested language, and all languages are requested. --
Tim
12:07, 11 November 2008 (UTC)
reply
01:19 brion: swapped Commons to use $wgNoticeProject 'wikimedia' rather than having separate 'Commons needs you' notices
00:57 brion: swapped in fundraiser to all projects
November 10
19:18 mark: Shutdown AMS-IX route server 1 session as it's been flapping for hours
November 9
16:11 river: removed nfsfind cronjob on ms1
November 7
22:52 brion_: tossing 2008_meter_2b notice into partial rotation on enwiki -- has reduced collapsed version
22:49 brion_: adding "_collapsed" to banner source tracking for collapsed view
22:27 brion: scapping updates to ContributionReporting and CentralNotice
01:43 Tim: experimentally reading the civicrm database into db10 with --master-data=1
01:19 brion: db9 temporarily (hopefully) messed up. tim's fiddling with it to put it back
01:05 Tim: my.cnf on db10 had an error in it, replicate-wild-do-tables instead of replicate-wild-do-table. Fixed it. The OTRS snapshot is now hopelessly out of date anyway, so I might wipe the data directory and start again. The idea is to set it up to replicate civicrm first. It's 100% InnoDB so should be easy to copy.
00:09 river: upgraded ms2 to solaris 10 update 6
November 6
21:03 Tim: switched GIFs to use Bitmap_ClientOnly (client-side scaling)
17:23 brion: restarting apache on srv47, seem smysteriously stuck
17:15 brion: setting $wgMaxAnimatedGifArea to 1 to prevent animated thumbnailing of GIFs for now, see if that helps
17:10 brion: river complaining of image scaler issues -- load spikes, depooling?
02:35 mark: disabled BGP, now using lvs2 only
02:25 mark: restarting lvs2 with new kernel
01:52 due to switch issues, load balancing to lvs2/lvs4 stopped working. Mark restarted the BGP session which fixed it temporarily.
01:42 Tim: restarting squids
01:42 mark: Setup lvs4 as temp LVS support for lvs2, balancing the load
01:07 brion: updated ContributionReporting to add paging links to ContributionHistory (might be a little funky w/ caching, we'll work it out :)
00:45 Tim: progressively clearing /a on the remaining image scalers
00:37 Tim: wiping /a on srv44
~00:30 lvs2 went into overload and started losing packets. Upload squid slowly went down over the next half hour.
00:00 brion: scapping for update to ContributionReporting
November 5
23:38 brion: set yongle to restart apache every hour since it still seems to bork up and get stuck sometimes
22:01 RobH: srv100 rebooted, was down.
18:28 mark: tech team is procrastinating
18:16 atglenn: added dhelps to office@wikimedia.org alias, redirected office@wikipedia.org to him also
18:14 brion: disabling centralnotice on private wikis, we don't need to be told to donate to ourselves ;)
18:03 brion: poking sitenotices off wikibooks, on *.wikipedia
18:03 brion: set up ariel on mchenry for mail admin
05:38 brion_: opera users may rejoice ;)
05:38 brion_: tweaked storage1 lighttpd config so centralnotice.js is served with utf-8 charset
05:17 brion_: for reference -- load spikes are page rendering on enwiki and dewiki mostly :)
05:16 brion_: bumping enwiki notice to 100%
05:06 Tim: killed various mysqld_safe processes which were using 100% CPU on ES servers
04:50 brion_: fixed morebots -- bots now allowed to edit again at wikitech
04:50 brion_: enabling enwiki notice at about 10% sampling
03:27 brion_: squids are... i think.... looking better :D
... brion: cleaned up movepage attack, restricted editing here for convenience
02:47 brion_: seems happier after restart of front-end squids
02:43 brion_: tim's doing hard restarts of more squids, we're kinda offline briefly
02:34 brion_: disabling centralnotices on remaining sites just for good measure while we debug
02:29 brion_: current status: the squids which borked are still kind of borked, but perhaps slightly better. mark is examining squid memory reports
02:14 brion: tim's attempting to restart borked squids
02:01 brion: disabling enwiki centralnotice while investigating hits dropoff
November 4
21:36 Tim: added nagios monitoring of HTTP on image backends
21:14 Tim: installed NRPE stuff on db19
19:37 Tim: killed the broken NFS mount on db21:/mnt with umount -l. The processes that are waiting for it will probably hang until system restart
18:33 brion_: enabling ja-wikipedia notice for testing :D
18:32 Tim: installed nagios stuff on db21,db22,db23
18:27 Tim: srv104 done, cluster18 re-added to the write list
18:15 Tim: installed NRPE on srv159,srv171,srv183
17:25 domas: bounced db16 after jfs deadlock
17:24 brion: settin' centralnotice on wikibooks to test, should show up in a few minutes
16:00 Tim: fixing max_rows on srv104
15:41 Tim: switching cluster18 master from srv104 to srv105
01:33 Tim: fixing max_rows on srv105 and srv106
01:28 Tim: removed cluster17 from the write list, is full.
November 3
23:28 Tim: installed xdiff and gmp on hume. Used a source install of libxdiff since it's not packaged, and pecl install for the pecl module. Used the stock libgmp, a source install from the debian sources for the PHP GMP module.
22:05 brion: enabled extra file upload types for foundationwiki, since it's restricted-write-access
21:42 Tim: initialising srv159/171/183 as cluster20.
21:24 Tim: srv159 needs to be an ext store, and so will be moved from the disk-intensive image scaler role back to an ordinary apache.
20:46 brion: Special:ContributionTracking form submission intermediary live on foundationwiki
20:33 brion: scapping for ContribtionTracking extension
19:59 brion: enabled mp3 and aiff uploads for private wikis so jay can upload some radio PSAs for fundraiser
19:46 brion: poking $wgSquidMaxage from 31 days to 1 hour on wikimediafoundation.org, since templates and funkypage URLs may do funky things and not get purged (extra parameters)
19:32 brion: note there's no notice up yet ;)
19:31 brion: enabling centralnotice loader on all wikis
11:00 domas:
mount -o remount,nobarrier /a
on db15, observed 20x more performance. I am an idiot. :)
02:36 brion-away: got a test centralnotice notice running on test.wikipedia.org. rock on
02:18 brion: set up every-10-minute cronjob on zwinger to regen the centralnotice template JS files
02:10 brion: centralnotice .js file loader up on test and meta for poking at
01:12 mark: level 3 blackholing of traffic disappeared, brought BGP sessions back up
00:59 mark: shutdown BGP session to AS 30217, for blackholing of traffic behind it (L3?)
00:58 brion: network problems at pmtpa
00:44 brion: for fun, did some load-time optimization on wikitech. trimmed out unneeded user/site .js, consolidated several .js files, and enabled mod_deflate for .css/.js. ssl setup time still sucks, and it's still a 1.7GHz Celeron. :)
November 2
23:43 brion: added bot flag to domas's log bot so it doesn't get hit by the URL captcha
23:29 domas: db19 jfs deadlocked:
23:28 brion: scapping for CentralNotice tweak update
23:11 brion: setting up ContactFormFundraiser on wikimediafoundation.org for fundraiser templates
22:52 brion: scapping for ContactPageFundraiser setup
22:41 brion: poked spamregex update
22:14 brion: added 403 block in checkers.php for 'speichern' GET parameter -- bug in a common dewiki user script allowing CSRF-type vandalism
17:13 Tim: Unmounted /tmp, cleaned up /tmp. Deleting /a/tmp on all image scalers.
16:48 Tim: set ImageMagick temporary directory to /a/magick-tmp. Will unbind the /tmp -> /a/tmp mount.
15:06 river: added missing /mnt/upload5 mount on several apaches: srv37 srv61 srv76 srv69 srv63 srv118 srv132 srv135 srv133 srv138 srv136
14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication
13:02 river: copying en from storage1 to ms1
10:49 domas: replaced XFS with JFS on db18, installed ganglia on db17-db30
10:36 river: completed move of commons, now being served from ms1 (except archive/)
November 1
22:48 brion: fixed ContributionReporting to force a utf8 connection, now loads names in right charset
22:20 brion: fixed $wgNoticeInfrastructure setting; defaults must have changed at some point
22:15 domas: installed wikimedia-mysql4 on db21-23, established s1,s2,s3 replication. we now have full database copy in sdtpa \o/
20:53 brion: deploying CentralNotice editing system on meta, woo
20:27 brion: scapping to update reporting and centralnotice bits internally
19:38 brion: rescapping to make sure 159 is unbroken
19:27 brion: svn up'ing on wikitech just for domas
19:25 brion: srv159 is out of space
We need to clean out the damn temp files somehow, eh?
19:20 brion: scapping to update ContributionReporting ext
12:56 mark: uppreffed traffic from knams to pmtpa via 6908/2828, as existing peering path had slight packet loss
11:25 Tim: enabled subpages in the main namespace by default for all Wikisource wikis. This appears to be a defacto standard and is used by all wikisources with an entry in wgNamespacesWithSubpages.
07:55 Tim: disabled ParserDiffTest, obsolete
07:06 mark: XO circuit back up:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now up
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 , session is now up
October 31
23:11 brion: set up some logs for fundraising banner campaign clicks for later mining
17:44 brion: adding support for Tomas skin on wikimediafoundation.org for new fundraiser templates
14:24 mark: XO circuit went down:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <207.88.246.5>, session is now down because
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now down because
October 30
23:11 Tim: fixed disk space on srv159, db1, srv103
19:03 brion: updated triggers for donation reporting database a few minutes ago
18:14 RobH: moved ms1 from pmtpa:a4 to sdtpa:a1, its back online.
17:46 RobH: db26 OS installed and online
17:28 brion: added a spam filter rule for private-l messages :)
04:54 river: testing sun web server on ms1
03:56 brion: updating squid conf to send upload /centralnotice to storage1 for testing
03:53 brion: tweaked lighttpd config on storage1 for centralnotice static file testing, since amane's configuration is too crappy to support regexes needed to set headers on a directory
02:59 brion: poking experimental expires options on amane for static centralnotice tests
02:44 brion: brion broke lighttpd.conf briefly
October 29
22:39 brion: enabling $wgCodeReviewENotif experimentally
18:35 brion: disabled bitmap fonts in fontconfig on image scalers, seems to help with the "mad helvetica" problem
18:02 RobH: db28 & db29 OS installed and online.
17:59 brion: fixed some upload directory perms on foundationwiki
17:12 RobH: db27 OS installed and online.
16:54 RobH: db21 OS installed and online.
16:38 RobH: db22, db23, db25, db30 were installed yesterday, forgot to admin log it, sorry ;/
14:44 _mary_kate_: copying wikipedia/commons/thumb/4 from storage1 to ms1
October 28
20:02 domas: re-enabled db16
18:03 mark: Removed blackholes.securitysage.com from lily's spamassassin configuration
17:52 domas: db16 fubar'ed by queries that built 100GB temporary tables, leading to jfs hangs, leading to unhappy kernel.
15:23 RobH: updated dsh node group ALL, added backup of frontend data for bugzilla and blogs from isidore to tridge.
12:33 rainman-sr: experimentally turning on "did you mean.." on search8,9 for enwiki
10:44 mark: Reverted yesterday's search changes
October 27
23:24 mark: Switched to lucenesearch 2.1 for all wikis
23:06 mark: pooled search8 as the only search server in search pool 3
22:25 mark: rainman-sr is making me do more ugly things to lucene.php
22:22 mark: Pointed search for "all other wikis" hardcoded to search7 in lucene.php
22:14 mark: Added zhwiki and plwiki to lucene search 2.1 pool 2
October 26
15:43 mark: Set up OpenGear serial console server scs-a1-sdtpa
13:37 mark: Set up iBGP between csw1-sdtpa and csw5-pmtpa (IPv4/IPv6)
13:36 mark: Prepared csw1-sdtpa for production deployment (general configuration)
09:56 domas: updated db18 firmware to 2.1.1 (September 2008)
04:31 Tim: fixed the "service_ips" hostgroup in nagios
03:03 Tim: hardware reboot of db18
02:47 Tim: mysqld on db18 apparently hit a kernel bug. It was reported as a zombie but was still using 200% CPU in top. kswapd was simultaneously using 100% CPU. Did not respond to SIGKILL. The non-zombie parent, mysqld_safe, also did not respond to SIGKILL (wchan=flush_cpu_workqueue). Attempted a reboot with shutdown -r.
02:47 brion: tweaked MaxClientsPerChild on yongle to see if that helps with the mysterious hangs i sometimes see where requests seem to get backed up; it's disrupting the CodeReview proxy as well as mobile & Mac Dictionary search
October 25
20:46 brion: scapped to r42573
08:17 Tim: svn up to 42536 for API overload fix. Re-enabling disabled query modules.
05:55 Tim: svn up/scap to 42531 (for properly tested Interwiki.php fix).
05:09 Tim: DB overload on many enwiki slave servers. Long running queries attributed to ApiQueryAllpages, ApiQueryBacklinks, ApiQueryCategoryMembers and ApiQueryLogEvents. Disabled those modules and killed related running threads.
05:01 Tim: Interwiki links were broken due totally broken and untested getInterwikiCached() function. Live patch deployed at this time.
04:33 Tim: Fixed svn conflicts in two files. Scap to r42524.
04:20 Tim: disabled Drafts extension on test.wikipedia.org. Trevor, please contact me for code review.
04:11 Tim: synced php-1.5 to srv35 and ran "make -B" in the serialized directory. Seems to have fixed test. Will scap.
01:01 ariel: preemptively up mail quota to 7GB from 1GB for cbass, dmenard
00:59 brion: testwiki is borked until we figure out how to get it to load updated message files. tried disabling $wgLocalMessageCache and $wgCheckSerialized to no effect
00:51 brion: temporarily blocking scap during testing :) ... running serialized language file updates for test, broken by need to get magic word updates
00:44 brion: preparing a svn up...
00:37 ariel: up msecoquian's mail quota from 1GB to 6.9GB
October 24
23:12 brion: set up ariel (the person) on sanger to do mail administration -- quota fixes etc
16:24 TimStarling: reloaded ourusers.sql on all core and ext. mysql servers, adding a nagios user
15:39 mark: slacking
15:36 TimStarling: added special nagios user to ES instances on clematis
14:00 domas: re-enabled db5, added db18 to s3
10:45 domas: taking out db5 for copy to db18
10:44 domas: fixed ntpd on bart, was pointing to multicast address that doesn't work
09:57 Tim: removed decommissioned servers from monitoring: dryas, alrazi, diderot, friedrich, samuel
07:50 Tim: added monitoring for toolserver ES clusters 17-19
07:40 Tim: regenerated trusted XFF list with extra SAIX proxies
05:00 Tim: fixed nagios check script handling of MySQL connection errors
01:37 brion: setting $wgLicenseURL for Collection to point at GFDL English text
01:01 brion: enabling Drafts on testwiki, but it seems to not be saving there... works on my local test, not sure what the issue is
01:03 brion: disabling logentry, still borken?
October 23
22:33 brion: trying re-enabling logentry ext on wikitech, now with cache disable to avoid edittoken for now
21:34 brion: updating ipblocks table definition
21:25 brion: re-ran svnImport to update path listings for CodeReview
20:11 mark: Set up
search7
search9
17:05 mark: Pooled
search4
as a s1 search server to help with dead search2
16:33 brion: updated mw-serve
15:38 Tim: On the image scalers, temporarily mounted /a/tmp as /tmp with --bind to stop the disk full problem while we figure out some better solution
15:24 Tim: removed temporary files on image scalers again
14:54 RobH: Replaced dead disk in amane, rebuilding array.
11:04 Tim: Added disk space monitoring for image scalers. Also added apache monitoring which was also missing.
10:53 Tim: freed up disk space on image scalers, magick-* temporary files were filling their root partitions
10:50 Tim: re-added cluster19 to the default write list. Not sure who took it out or why.
10:32 Tim: freed up some space on srv103 (was down to 500MB)
10:29 Tim: fixed monitoring for MegaRAID SAS
07:10 Tim: Set up monitoring of RAID status for all Ubuntu DB servers using the wikimedia-raid-utils package that I just wrote. It doesn't do anything on the MegaRAID servers yet, but the Adaptec ones should work.
05:05 Tim: running CodeReview svnImport.php
October 22
18:26 brion: enabling ODT output for collection
18:17 brion: updating collection and codereview extensions
18:13 Brion: updated mw-serve code and configured to send error emails per jojo's request
17:15 Brion: Changed bugzilla's mail delivery from local sendmail (SSMTP) to direct SMTP, per Mark's recommendation
October 21
19:29 RobH: Bayes upgraded from 2GB to 10GB.
13:49 Tim: Did a demonstration hack of nagios from CSRF to arbitrary shell. Disabled cmd.cgi.
04:13 Tim: Brought srv43-47 up as image scalers with mem limit 6 x 200MB = 1200MB (2GB physical)
October 20
18:11 RobH: srv118 rebooted, back online.
17:25 RobH: srv79 was in kernel panic, rebooted.
05:10 Tim: increased concurrency on srv159 to 15, for mem limit 15 x 200MB = 3000MB
02:40 Tim: installed NRPE on khaldun and db20
02:20 Tim: moved disk space checks on the ext stores from the "apaches" service group to the relevant ext store service group
01:53 Tim: installed NRPE on the new ext stores
01:45 Tim: Updated /etc/ssh/ssh_known_hosts on bart (copied from zwinger).
00:30-01:30 Tim: Listed down servers on
DC tasks
. Removed broken servers from memcached rotation. Restarted apache on srv99, srv109, srv123. Purged master binlogs on srv102.
October 18
21:45 RobH's mighty index finger brought amane and the site back up.
21:00 river: Ran 'nc -l -p 623' command, amane's kernel panic'ed. Rob was called.
20:55 mark, river: diagnosed the NFS communication problems to be caused by NIC hardware packet interception of port 623 packets... amane wasn't receiving NFS replies from ms1.
19:40 mark: Upload got unhappy, ms1 NFS mount on amane was unreachable and stalling things
13:40 Tim: down again, single process allocating all memory
07:35 Tim: took it down again, while recording /proc/vmstat and /proc/stat
06:27 Tim: restarted srv160
05:45 Tim: took srv160 into the purple for a much more convincing overload, and different oprofile results
03:40 Tim: used oprofile to determine what part of the kernel is responsible for the system CPU spike. Looks like a spinlock in dnotify.
03:12 Tim: simulated a memory-intensive request rate spike to srv160. Large system CPU response spike, but it didn't go down completely. Will try a bigger one.
October 17
21:10 brion: enabled Commons foreign image repo on Wikitech
18:45 brion: created Wikimedia-Boston list for SJ
16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
16:45 brion: deleted some junk comments from bugzilla
16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
14:22 RobH: srv131 back up.
09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
01:59 Tim: removing cups on all servers where it is running
00:00 RobH: restarted srv43-47
October 16
20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
18:40 RobH: rebooted scs-ext opps!
18:26 RobH: srv61 reinstalled and redeployed.
18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on
zwinger
17:02 brion: thumbnails on commons are insanely slow and/or broken
14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
11:38 Tim: cleaned up temporary files on srv159, had filled its disk
11:25 Tim: synced upload scripts (including to ms1)
10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
08:26 Tim: Restarted ntpd on search7, was broken
06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.
October 15
19:49 brion: added 'Category
Server Admin Log archive
Server Admin Log/Archive 13
Add topic