⚓ T319965 Migrate phetools from Toolforg

Quick status update

The web front end uses both PHP and Python, so it can't run on any of the standard k8s images.

The backend services are all written in Python, but use a lot of app-internal libraries, upstream non-default Python libraries (like pywikibot), and third party binaries (like tesseract). Which means they can't just be fired up on any of the standard k8s images.

All of which means we are going to have to get into the Toolforge Build Service to build a custom image that can run both the web frontend and the backend services. Which in turn means we are going to have to create a Gitlab repo for it, since the Build Service has to have a Git repo to work from. Because we have mixed runtimes (PHP + Python) the buildpacks autodetection is not going to work. And because we have a lot of libraries and binaries as deps we're going to have to figure out all the nitty details of how to customize the Build Service builds, especially the parts that are even worse documented than the basic poor state of the documentation. The buildpacks (and the Build Service) also assume modern coding conventions (e.g. using Python venvs to manage deps) so in the process we are going to have to modernize the code at least somewhat.

All of which points to this being a long and tedious slog of a job, starting with importing all our old grotty code into Git, reverse engineering all our deps for venv, and then iterating on the Build Service images to get them working. Oh, and to make this as annoying as possible we can't work on these images locally due to needing deps from apt, so we're going to have to do all this on Toolforge, in production, in a tightly coupled code base.

The Gitlabs repo is going to have to be tightly managed, unlike the phetools account on Toolforge where first Phe and then subsequent admins have used it more like a ad hoc shell account than what is required for a k8s service (lots and lots of random temporary files in the home directory, lots of test and debug versions of code in the code directories, mixing code and other stuff, etc.). If we bring the current mess with us we'll never get the Build Service to work reliably and figuring out the migration is going to take ten times as long.

Some architecture notes

The web frontend expects to run on a "fat" image under lighttpd. It consists of some relatively light PHP files that mostly provide the interactive web interface on phetools. Most of the content served by PHP is pre-generated static assets from NFS, with some minor CGI-ish stuff. The exception is job status pages where it communicates with other running jobs over a socket and a custom text-based protocol (ping, status, etc.). The PHP is still very minimal, so migrating it to a "thin" image shouldn't be a big problem. The Python parts are what provide the API for things like the OCR and Match&Split Gadgets on-wiki. These parts use the same libraries as the actual jobs themselves, and so it basically needs everything the jobs need.

Communication between the frontend and backend jobs is established by each job writing its actual hostname and a port number it is listening on to a text file on shared NFS storage. The frontend reads this file, opens a socket to that host:port, sends a text command, waits for the answer, and sends it to the web browser. All the backend jobs use a common but app-custom library to implement this.

Phetools consists of the webservice described above, a set of scheduled grid jobs, and a set of continuous grid jobs.

The webservice

The webservice lives in public_html and currently consists of the following:

public_html/

May 20  2014 commonPrint.css -> ../phe/common/css/commonPrint.css
Oct 28  2021 data/
Oct 27  2021 dbtest.php -> /data/project/phetools/phe/statistics/dbtest.php
May 30  2014 dummy_robot.php -> ../phe/dummy_robot/dummy_robot.php
May 30  2014 extract_text_layer.php -> ../phe/extract_text_layer/extract_text_layer.php
Nov 23  2021 graphs/
Aug  5  2014 hocr.php -> ../phe/hocr/hocr.php
May 15  2014 index.html -> ../phe/public_html/index.html
May  4  2020 jquery-3.5.1.min.js
Oct  6  2020 jquery-3.5.1.min.js.save
May 17  2014 log-irc -> ../log/log-irc/
May 30  2014 match_and_split.php -> ../phe/match_and_split/match_and_split.php
Apr 27  2016 not_transcluded -> ../tmp/transclusions/
Jun  5  2014 ocr.php -> ../phe/ocr/ocr.php
Nov  3  2021 proofread.html -> ../phe/statistics/proofread.html
Nov  3  2021 proofread_per_day.html -> ../phe/statistics/proofread_per_day.html
May 20  2014 screen.css -> ../phe/common/css/screen.css
May 20  2014 shared.css -> ../phe/common/css/shared.css
Jan 29  2020 sorttable.css -> ../phe/statistics/sorttable.css
Jan 29  2020 sorttable.js -> ../phe/statistics/sorttable.js
Oct 13  2020 statistics.js
May 20  2014 statistics.php -> ../phe/statistics/statistics.php
Oct 27  2021 statistics2.php -> /data/project/phetools/phe/statistics/statistics2.php
May 20  2014 stats.html -> ../phe/statistics/stats.html
Oct 27  2021 stats_table.css -> /data/project/phetools/phe/statistics/stats_table.css
May 26  2014 status.php -> ../phe/public_html/status.php
May 20  2014 transclusions.html -> ../phe/statistics/transclusions.html
Nov  3  2021 validated.html -> ../phe/statistics/validated.html
Nov  3  2021 validated_per_day.html -> ../phe/statistics/validated_per_day.html
May 30  2014 verify_match.php -> ../phe/verify_match/verify_match.php

Not all files are used; some are obsolete and stopped working long ago, and some are tests. The graphs subdir contains static image pregenerated by a scheduled job. The JS parts are pretty simple and don't really need a local jQuery (and since the privacy policy apply to Toolforge we might as well use upstream jQuery from cdnjs).

Scheduled jobs

The current scheduled jobs are:

crontab -l

### KUBERNETES MIGRATION IN PROGRESS ###
### Please do not add any more Grid Engine jobs ###
#
47 4 * * * jsub -N wikisource_stats -l h_vmem=1024M -o ~/log/wikisource_stats.out -e ~/log/wikisource_stats.err ~/phe/statistics/gen_stats.sh
28 4 * * * jsub -N phe_logrotate -o ~/log/logrotate.out -e ~/log/logrotate.err /usr/sbin/logrotate ~/phe/logrotate.conf -s ~/log/logrotate.status
# Do not reenable hocr_request. It will try to download and run OCR on every
# single DjVu and PDF file on every single wikisource + commons. Xover 7. nov. 2021
#47 * * * * jsub -N hocr_request -o ~/log/hocr_request.out -e ~/log/hocr_request.err -v PYTHONPATH=$HOME/phe python3 ~/phe/hocr/hocr_request.py -prepare_request
*/10 * * * * jlocal python3 ~/phe/jobs/sge_jobs.py >> ~/log/sge_jobs.out 2>>  ~/log/sge_jobs.err
# broken with stretch
#    27 4 * * *  jsub -N wsircdaemon -once -o ~/log/cron_irc.err -e ~/log/cron_irc.err -quiet python -u phe/ircbot/pyirclogs.py
# Nice try, but it can't work as run_service.sh use jsub which is not available
# in exec node
#  13 4 * * *  jsub -N restart_ws_ocr -once -o ~/log/cron_ws_ocr.out -e ~/log/cron_ws_ocr.err ~/phe/run_service.sh restart ws_ocr_daemon
17 5 * * * jsub -N not_transcluded -o ~/log/not_transcluded.out -e ~/log/not_transcluded.err python3 ~/phe/statistics/not_transcluded.py
33 5 * * * jsub -N ppdir -o ~/log/dbtest.out -e ~/log/dbtest.err python3 ~/phe/statistics/dbtest.py
45 5 * * * jsub -N restart_match_split -o ~/log/cron_match_split.out -e ~/log/cron_match_split.err ~/phe/run_service.sh restart match_and_split

The IRC stuff hasn't worked in ages and I see no need to try to revive it, so that's going the way of the dodo.

restart_match_split is voodoo "fixing" a problem with the job in question frequently hanging and should probably just be dropped so we can fix it properly.

logrotate is just calling the system logrotate with our config to rotate our own log files. We really should have a better facility for logrotate built-in to Toolforge, but at least this one should be fairly straightforward to migrate to a k8s scheduled job.

sge_jobs is going to be a headache, because here's where we parse the Grid Engine accounting file to keep track of our own internally-scheduled jobs, and update the custom DB that in essence implements a completely custom job queue. I don't think we have access to anything comparable on k8s, and to the degree we do it would mean talking to the raw Kubernetes API. I don't fully understand this system or what depends on it, but I seem to recall that last time it had trouble other stuff started breaking. And the second we touch this we run the risk of adding garbage to that database that isn't going to be easy to untangle without any real documentation.

wikisource_stats isn't actually written in shell, it just has a thin shell wrapper to set the PYTHONPATH for pywikibot (Py2 v. Py3) and start the script with python3. It otherwise uses the same libraries etc. and pywikibot as the rest of the main tools.

How to deal with stdout/stderr logging is going to be an issue for all these (and our backends) on k8s.

Grid Engine jobs

The real backend for phetools are a set of continuous Grid Engine jobs started from ~phe/run_service.sh. The script gives you start/stop/status/restart options, sets environment variables and paths, etc. The jobs there are currently:

run_service.sh

dummy_robot
ws_ocr_daemon
match_and_split
extract_text_layer
verify_match

dummy_robot I am not sure of the purpose of (it may be just a test job). The rest are moving parts of the OCR and Match&Split services, and data generated by these are used in various other places. These have a custom common app framework, implement a lot of stuff that's now available upstream in pywikibot (lists of languages etc.), custom job queue system, massive caches, and write to the wikis using the phe-bot account (and as such should probably really be migrated to OAuth, but since it's very asynchronous that raises a whole host of other issues).

⚓ T319965 Migrate phetools from Toolforge GridEngine to Toolforge Kubernetes

Quick status update

Some architecture notes

The webservice

Scheduled jobs

Grid Engine jobs