Installation Guide - Apache MADlib - Apache Software Foundation
DUE TO SPAM, SIGN-UP IS DISABLED. Goto
Selfserve wiki signup
and request an account.
Apache MADlib
Pages
Page tree
Browse pages
tachments (2)
Page History
Resolved comments
Page Information
View in Hierarchy
View Source
Export to PDF
Export to Word
Copy Page Tree
Jira links
Installation Guide
Created by
Frank McQuillan
, last modified by
Orhan Kislal
on
Sep 02, 2023
To set up PostgreSQL and MADlib with Anaconda Python on OSX, follow the
super quick start
Otherwise, follow the regular guides for
installing from binaries
or
compiling from source
For developers, you may want to use the
Docker image
described in the Developer Guide.
Sometimes there are release specific variations of the installation procedures. These exceptions are listed at the bottom of this page in the section called
Release Specific Installations
MADlib requires python version 2.7. Currently, Python 3.x is not supported.
Currently supported database versions: Please see
this page
for supported databases and OS
The following python libraries are required for their associated modules:
Deep Learning: dill, grpcio==1.39.0, protobuf==3.17.3, hyperopt==0.2.5, tensorflow == 1.14, scikit-learn==0.19
XGBoost: pandas, xgboost==0.82
KNN: scipy==1.2.1
Unit tests: pgsanity
Super Quick Start
To set up PostgreSQL + MADlib with Anaconda Python on OSX:
PYTHON=/Users/janedoe/anaconda/bin/python
Install Postgres with the Python extension specified (i.e.,
--with-python),
as described here in the PostgreSQL documentation
. Note that previously you could install postgres with python support using brew by running '
brew install postgresql
--with-python
but passing the '
--with-python
' flag is not supported anymore.
Set up database and roles
Install the .dmg of latest madlib downloaded from MADlib website
/usr/local/madlib/bin/madpack -s madlib -p postgres install
Quick Start With Binaries
Prerequisites
Install and configure your database of choice. MADlib currently supports the following platforms:
PostgreSQL
Greenplum database
MADlib requires the
GNU M4 Unix macro processor
which must be present for installation to succeed.
If
the environment variables listed below are defined
, it can save you some typing.
Postgres platform notes:
Ensure that you install Postgres with the Python extension specified (i.e.,
--with-python),
as described here in the PostgreSQL documentation
If not you will see an error message like the one below when you try to install MADlib with madpack:
/usr/local/madlib/bin/madpack -s madlib -p postgres install
madpack.py : INFO : Detected PostgreSQL version 9.5.
madpack.py : INFO : *** Installing MADlib ***
madpack.py : INFO : MADlib tools version = 1.9.1 (//usr/local/madlib/Versions/1.9.1/bin/../madpack/madpack.py)
madpack.py : INFO : MADlib database version = None (host=localhost:5432, db=postgres, schema=madlib)
madpack.py : INFO : Testing PL/Python environment...
madpack.py : INFO : > Creating language PL/Python...
madpack.py : ERROR : SQL command failed:
SQL: CREATE LANGUAGE plpythonu;
ERROR: could not access file "$libdir/plpython2": No such file or directory
madpack.py : ERROR : Cannot create language plpythonu. Please check if you
have configured and installed portid (your platform) with
`--with-python` option. Stopping installation...
madpack.py : ERROR : MADlib installation failed
Installing MADlib
Download the MADlib binary
For Postgres: OS X and Linux binaries can be found on the
MADlib download page
For Greenplum: Linux .gppkg binaries can be found on
Pivotal Network
in the "Greenplum Advanced Analytics Group"
NOTE: the above .gppkg binaries work for both open and closed source Greenplum and can be downloaded by anybody (after creating a Pivotal Network account)
Install the package.
Postgres:
on OSX double click the installer package
on Redhat / CentOS run the following as root:
yum install
or
rpm -i
Greenplum:
on Redhat / CentOS run the following as gpadmin:
gppkg -i
NOTE: if you are using an rpm package on a CentOS 5 system, please add
--no-deps
flag to the command.
Ensure that the environment is setup for your database deployment and that the database is up and running.
Ensure that psql, postgres, and pg_config are in your path
which psql postgres pg_config
Ensure that the database is started and running
psql -c 'select version()'
The above may need user/port/password setting depending on how the database has been configured.
Run the MADlib deployment utility to deploy MADlib into each database that you want to use it:
Postgres:
/usr/local/madlib/bin/madpack -s madlib –p postgres install
if
environment variables are defined
. Otherwise use a fully defined connection string:
/usr/local/madlib/bin/madpack -s madlib -p postgres -c [user[/password]@][host][:port][/database] install
Greenplum Database:
/usr/local/madlib/bin/madpack –p greenplum install
The above may need user/port/password setting depending on how the database has been configured.
Run the MADlib madpack deployment utility to install MADlib into each database that you want to use it in:
After installation gpadmin should grant all privileges on schema madlib to users who will be accessing MADlib functions.
Otherwise, users will get "ERROR: permission denied for schema MADlib." Also, install checks (see next step below) will fail if CREATE TEMP TABLE privileges are not granted on the schema where MADlib is installed.
See the PostgreSQL docs for i
nformation on schemas and privileges
Test your installation
Postgres:
/usr/local/madlib/bin/madpack -s madlib –p postgres install-check
Greenplum Database:
/usr/local/madlib/bin/madpack –p greenplum install-check
The above may need user/port/password setting depending on how the database has been configured.
Please note that if the optimizer_control GUC is set to off in Greenplum, the following install checks will fail, and these MADlib functions will not work: decision tree, random forest, LDA , k-Means, PMML export for decision tree, PMML export for random forest. This will be fixed in a future release (
MADLIB-1109
).
The parameter
optimizer_control
controls whether the server configuration parameter optimizer can be changed. The parameter
optimizer
controls whether the GPORCA optimizer is enabled when running SQL queries.
Installing from PGXN (PostgreSQL)
Prerequisites
Requirements for installing MADlib:
gcc and g++ (For OSX, Clang will work for compiling the source, but not for documentation.). Note: C++11 is not fully supported yet.
m4
patch
cmake
pgxn installed
PostgreSQL (64-bit) 9.2+ with plpython support enabled. Note: plpython may not be enabled in Postgres by default.
Use below command to install and load the latest MADlib package uploaded on
PGXN
pgxn install madlib
pgxn load madlib
If you see the following error, it's likely that you are using Parallel Execution flags for make.
[ 86%] Performing build step for 'EP_boost'
Ignored: make
[ 86%] Performing install step for 'EP_boost'
Ignored: make
[ 86%] Completed 'EP_boost'
[ 86%] Built target EP_boost
make[1]: *** [all] Error 2
make: *** [all] Error 2
ERROR: command returned 2: make PG_CONFIG=/usr/local/pg10/bin/pg_config all
You can run this as a workaround:
MAKEFLAGS='-j1' pgxn install madlib
pgxn load madlib
Or, if you want to use parallel execution, you can also install Boost 1.60 yourself, and tell cmake where to find it.
For example, on OSX that looks like this:
brew install boost@1.60
export BOOST_INCLUDEDIR=/usr/local/opt/boost@1.60/include/
Compiling From Source
Prerequisites
Requirements for installing MADlib:
gcc and g++
For OS X, Clang will work for compiling the source, but not for the documentation. To compile on newer versions of XCode we need to enable the CXX11 flag. Setting
-DCXX11=1
during cmake, will auto-download Boost 1.75.0 if Boost > 1.65.0 is not found on the system.
Note: Setting
-DCXX11=1
will enable C++11, which is not fully supported, i.e, MADlib compiles but some install-check/dev-check tests may fail.
python 2.6 or 2.7
python 3.x is not currently supported by MADlib.
cmake
NOTE: the latest version of cmake might cause issues. Please try
cmake 3.5.2
in case you get an error or a segmentation fault.
NOTE: On Centos 6 (possibly other Linux variants), we have seen occasions where cmake will have issues running (seg fault) if the greenplum_path.sh file has been
source
d prior to the cmake execution. If you encounter issues, you can use ldd on the cmake executable to confirm dynamic libraries are picked up from the Greenplum installation directories. If this is the case, start a new shell in which the greenplum_path.sh file is not
source
d in your current running shell session. You can reference
MADLIB-1093
for additional details.
An installed version of Greenplum Database or PostgreSQL (64-bit) 9.2+ with plpython support enabled.
NOTE: plpython may not be enabled in Postgres by default.
Installing MADlib
In the
$MADLIB_ROOT
directory (location of the MADlib source) run the following commands:
mkdir build
cd build
cmake ..
make -j8 # if this causes issues, switch back to a plan `make`
Above, we built the executables in the
build
folder. This can, however, be any user-named folder (henceforth called
$BUILD_ROOT
).
Deploying MADlib
Deploy MADlib into the database with MADlib package manager
madpack
located under
$BUILD_ROOT/src/bin
Run the MADlib deployment utility to install MADlib into each database that you want to use it:
Postgres:
$BUILD_ROOT/src/bin/madpack -s madlib –p postgres install
if
environment variables are defined
. Otherwise use a fully defined connection string:
$BUILD_ROOT/src/bin/madpack -s madlib -p postgres -c [user[/password]@][host][:port][/database] install
Greenplum Database:
$BUILD_ROOT/src/bin/madpack –p greenplum install
The above may need user/port/password setting depending on how the database has been configured.
To install:
$BUILD_ROOT/src/bin/madpack -p postgres -c [user[/password]@][host][:port][/database] install
To make sure that the installation is successful:
$BUILD_ROOT/src/bin/madpack -p postgres -c [user[/password]@][host][:port][/database] install-check
For more information on the usage of
madpack:
$BUILD_ROOT/src/bin/madpack --help
Defining environment variables
The variables below will be automatically used by the
madpack
installer if no connection string is provided:
User:
PGUSER
or
USER
(defaults to OS username)
Password:
PGPASSWORD
(defaults to empty)
Host:
PGHOST
(defaults to 'localhost')
Database:
PGDATABASE
(defaults to OS username)
Port:
PGPORT
(defaults to 5432)
An example of deploying MADlib using the environment variables:
export PGPORT=5430
export PGHOST=127.0.0.1
export PGDATABASE=madlibtest
$BUILD_ROOT/src/bin/madpack -p postgres install
Defining GPDB variables
The variables below can be set in GPDB in case memory-related issues show up. Feel free to adjust them based on the specifics of the installed system.
set max_statement_mem='50GB';
set statement_mem='50GB';
set memory_spill_ratio=80;
set gp_resqueue_memory_policy=auto;
set work_mem='4GB';
set gp_vmem_protect_limit=20000
Upgrading MADlib gppkg
Download the MADlib binary
Greenplum database : Download the .gppkg binary from
Pivotal Network
Upgrade MADlib gppkg.
Greenplum Database:
Upgrading gppkg to a higher version of MADlib:
For example, upgrading from 1.15.1 to 1.16
on Redhat / CentOS run the following as gpadmin:
gppkg -u
Upgrade the MADlib deployment in the database
madpack -p
Upgrading gppkg for the same version of MADlib:
For example, upgrading from madlib_gppkg_1.16+1 to madlib_gppkg_1.16+2
on Redhat / CentOS run the following as gpadmin:
gppkg -u
MADlib deployment in the database does not need to be upgraded as the MADlib version has not changed.
Release Specific Installations
Sometimes there are release specific variations of the installation procedures. These exceptions are listed in this section.
06/27/19 - Upgrading MADlib from 1.15
Currently, upgrading the rpm from 1.15 using
rpm -U
does not work due to a change in the rpm post uninstall script in MADlib version 1.15.1. Below are the steps to follow to upgrade from MADlib version 1.15:
Remove existing MADlib rpm (this does not affect the database in any way)
rpm -e
Remove old MADlib files
rm -rf /usr/local/madlib/Versions
Install the MADlib 1.15.1 or 1.16 rpm
rpm -i
Upgrade the MADlib deployment in the database
madpack -p
01/11/18 - Upgrading MADlib to 1.13
The upgrade to v1.13 has a minor problem with some leftover functions. The issue can be fixed with the following commands before running the regular madpack upgrade command.
psql <
psql <
<
<
We have also attached a script to this wiki page called 'fix_upgrade.sh' that you can use.
11/30/16 - Installation of MADlib 1.9.1 on GPDB 4.3.11
The procedure exactly the same as described below for
installation of MADlib on GPDB 4.3.10
10/19/16 - Installation of MADlib 1.9.1 on GPDB 4.3.10
This is an important note for installation of MADlib on GPDB 4.3.10. It does not apply to any other releases.
1) Fix madpack install utility
* issue: After gppkg installation MADlib, you must run the script
fix_madpack.sh BEFORE running the madpack utility (see below). The script is downloadable from the
Pivotal Network
2) install checks
* issue: some failures may happen on MADlib install checks, however the MADlib install actually completed OK.
This is a poor customer experience that will be fixed in the next release. On the positive side, once the installation is done, MADlib should work OK.
------------------------------
More on fixing madpack from #1 above:
After gppkg installation MADlib, you must run the script
fix_madpack.sh BEFORE running the madpack utility.
The syntax for fix_madpack.sh is below.
This can be somewhat confusing because after gppkg
installation, you will get a message on the console
that says:
“Please run the following command to deploy MADlib
usage: madpack install [-s schema_name] -p hawq -c user@host:port/database
etc...”
So the correct order of operations is:
1. gppkg install of MADlib
2. run fix_madpack.sh
3. run madpack utility
*****************************************************
COMMAND NAME: fix_madpack.sh
*****************************************************
Script to fix a MADlib installation issue on GPDB 4.3.10.
This script patches a line in madpack.py, the MADlib installation
script. A backup of the original file is created in the same folder as
madpack.py called 'madpack.py.orig'. The script is downloadable from the
Pivotal Network
*****************************************************
SYNOPSIS
*****************************************************
fix_madpack.sh [--prefix
fix_madpack.sh -h
*****************************************************
PREREQUISITES
*****************************************************
The following tasks should be performed prior to executing this script:
* Set $GPHOME to the correct GPDB installation directory containing MADlib
OR
* Set MADlib installation path using the --prefix option
*****************************************************
OPTIONS
*****************************************************
--prefix
Optional. Expected MADlib installation path. If not set, the default value
${GPHOME}/madlib is used.
-h | -? | --help
Displays the online help.
*****************************************************
EXAMPLE
*****************************************************
/home/gpadmin/madlib/fix_madpack.sh --prefix /usr/local/gpdb/madlib
No labels
Overview
Content Tools
Atlassian Confluence Open Source Project License
granted to Apache Software Foundation.
Evaluate Confluence today
Atlassian Confluence
8.5.31
Printed by Atlassian Confluence 8.5.31
Report a bug
Atlassian News
Atlassian
{"serverDuration": 113, "requestCorrelationId": "f166555f8bb2669f"}
US