azkaban-aplcache

Migrate how-to and plugins doc sections to restructured text …

8/3/2018 6:05:18 PM

format (#1889)

Ramana Inukonda

Commit: 9a481ab

Tree: dac23f9

Parents: 34bf82c

Changes

docs/howTo.rst 74(+74 -0)

docs/index.rst 2(+2 -0)

docs/plugins.rst 411(+411 -0)

Details

docs/howTo.rst 74(+74 -0)

diff --git a/docs/howTo.rst b/docs/howTo.rst
new file mode 100644
index 0000000..c6e810b
--- /dev/null
+++ b/docs/howTo.rst
@@ -0,0 +1,74 @@
+.. _how-to:
+
+How Tos
+=======
+
+Force execution to an executor
+------------------------------
+
+Only users with admin privileges can use this override. In flow params:
+set ``"useExecutor" = EXECUTOR_ID``.
+
+Setting flow priority in multiple executor mode
+-----------------------------------------------
+
+Only users with admin privileges can use this property. In flow params:
+set ``"flowPriority" = PRIORITY``. Higher numbers get executed first.
+
+Enabling and Disabling Queue in multiple executor mode
+------------------------------------------------------
+
+Only users with admin privileges can use this action. Use curl or simply
+visit following URL:-
+
+-  Enable: ``WEBSERVER_URL/executor?ajax=disableQueueProcessor``
+-  Disable: ``WEBSERVER_URL/executor?ajax=enableQueueProcessor``
+
+Reloading executors in multiple executor mode
+---------------------------------------------
+
+Only users with admin privileges can use this action. This action need
+at least one active executor to be successful. Use curl or simply visit
+following URL:- ``WEBSERVER_URL/executor?ajax=reloadExecutors``
+
+Logging job logs to a Kafka cluster
+-----------------------------------
+
+Azkaban supports sending job logs to a log ingestion (such as ELK)
+cluster via a Kafka appender. In order to enable this in Azkaban, you
+will need to set two exec server properties (shown here with sample
+values):
+
+.. code-block:: guess
+
+   azkaban.server.logging.kafka.brokerList=localhost:9092
+   azkaban.server.logging.kafka.topic=azkaban-logging
+
+These configure where Azkaban can find your Kafka cluster, and also
+which topic to put the logs under. Failure to provide these parameters
+will result in Azkaban refusing to create a Kafka appender upon
+requesting one.
+
+In order to configure a job to send its logs to Kafka, the following job
+property needs to be set to true:
+
+.. code-block:: guess
+
+   azkaban.job.logging.kafka.enable=true
+
+Jobs with this setting enabled will broadcast its log messages in JSON
+form to the Kafka cluster. It has the following structure:
+
+.. code-block:: guess
+
+   {
+     "projectname": "Project name",
+     "level": "INFO or ERROR",
+     "submituser": "Someone",
+     "projectversion": "Project version",
+     "category": "Class name",
+     "message": "Some log message",
+     "logsource": "userJob",
+     "flowid": "ID of flow",
+     "execid": "ID of execution"
+   }

docs/index.rst 2(+2 -0)

diff --git a/docs/index.rst b/docs/index.rst
index ffac5aa..7086361 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -37,6 +37,8 @@ Features
    useAzkaban
    eventTrigger
    ajaxApi
+   howTo
+   plugins

docs/plugins.rst 411(+411 -0)

diff --git a/docs/plugins.rst b/docs/plugins.rst
new file mode 100644
index 0000000..482f81c
--- /dev/null
+++ b/docs/plugins.rst
@@ -0,0 +1,411 @@
+Plugins
+========
+..
+	TODO:Fix download page
+
+Azkaban is designed to be modular. We are able to plug in code to add
+viewer pages or execute jobs in a customizable manner. These pages will
+describe the azkaban-plugins that can be downloaded from `the download
+page <%7B%7B%20site.home%20%7D%7D/downloads.html>`__ and how to extend
+Azkaban by creating your own plugins or extending an existing one.
+
+.. _hadoopsecuritymanager:
+
+HadoopSecurityManager
+---------------------------
+
+The most common adoption of Azkaban has been in the big data platforms
+such as Hadoop, etc. Azkaban's jobtype plugin system allows most
+flexible support to such systems.
+
+Azkaban is able to support all Hadoop versions, with support for Hadoop
+security features; Azkaban is able to support various ecosystem
+components with all different versions, such as different versions of
+pig, hive, on the same instance.
+
+A common pattern to achieve this is by using the
+``HadoopSecurityManager`` class, which handles talking to a Hadoop
+cluster and take care of Hadoop security, in a secure way.
+
+Hadoop Security with Kerberos, Hadoop Tokens
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When Hadoop is used in enterprise production environment, it is
+advisable to have its security feature turned on, to protect your data
+and guard against mistakes.
+
+**Kerberos Authentication**
+
+The most common authentication provided by Apache Hadoop is via
+Kerberos, which requires a KDC to authenticate users and services.
+
+A user can authenticate with KDC via username/password or use a keytab.
+KDC distributes a tgt to authenticated users. Hadoop services, such as
+name node and job tracker, can use this tgt to verify this is
+authenticated user.
+
+**Hadoop Tokens**
+
+Once a user is authenticated with Hadoop services, Hadoop will issue
+tokens to the user so that its internal services won't flood KDC. For a
+description of tokens, see
+`here <http://hortonworks.com/blog/the-role-of-delegation-tokens-in-apache-hadoop-security/>`__.
+
+**Hadoop SecurityManager**
+
+For human users, one authenticate with KDC with a kinit command. But for
+scheduler such as Azkaban that runs jobs on behalf as other users, it
+needs to acquire tokens that will be used by the users. Specific Azkaban
+job types should handle this, with the use of ``HadoopSecurityManager``
+class.
+
+For instance, when Azkaban loads the pig job type, it will initiate a
+HadoopSecurityManager that is authenticated with the desired KDC and
+Hadoop Cluster. The pig job type conf should specify which tokens are
+needed to talk to different services. At minimum it needs tokens from
+name node and job tracker. When a pig job starts, it will go to the
+HadoopSecurityManager to acquire all those tokens. When the user process
+finishes, the pig job type calls HadoopSecurityManager again to cancel
+all those tokens.
+
+Settings Common to All Hadoop Clusters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a user program wants to talk to a Hadoop cluster, it needs to know
+where are the name node and job tracker. It also needs to know how to
+authenticate with them. These information are all in the Hadoop config
+files that are normally in ``$HADOOP_HOME/conf``. For this reason, this
+conf directory as well as the hadoop-core jar need to be on azkaban
+executor server classpath.
+
+If you are using Hive that uses HCat as its metastore, you also need
+relevant hive jars and hive conf on the classpath as well.
+
+**Native Library**
+
+Most likely your Hadoop platform depends on some native library, this
+should be specified in java.library.path in azkaban executor server.
+
+**temp dir**
+
+Besides those, many tools on Hadoop, such as Pig/Hive/Crunch write files
+into temporary directory. By default, they all go to ``/tmp``. This
+could cause operations issue when a lot of jobs run concurrently.
+Because of this, you may want to change this by setting
+``java.io.tmp.dir`` to a different directory.
+
+Settings To Talk to UNSECURE Hadoop Cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you are just starting out with Hadoop, chances are you don't have
+kerberos authentication for your Hadoop. Depending on whether you want
+to run everything as azkaban user (or whatever user started the azkaban
+executor server), you can do the following settings:
+
+-  If you started the executor server with user named azkaban, and you
+   want to run all the jobs as azkaban on Hadoop, just set
+   ``azkaban.should.proxy=false`` and ``obtain.binary.token=false``
+-  If you started the executor server with user named azkaban, but you
+   want to run Hadoop jobs as their individual users, you need to set
+   ``azkaban.should.proxy=true`` and ``obtain.binary.token=false``
+
+Settings To Talk to SECURE Hadoop Cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For secure Hadoop clusters, Azkaban needs its own kerberos keytab to
+authenticate with KDC. Azkaban job types should acquire necessary Hadoop
+tokens before user job process starts, and should cancel the tokens
+after user job finishes.
+
+All job type specific settings should go to their respective plugin conf
+files. Some of the common settings can go to commonprivate.properties
+and common.properties.
+
+For instance, Hadoop job types usually require name node tokens and job
+tracker tokens. These can go to commonpriate.properties.
+
+**Azkaban as proxy user**
+
+The following settings are needed for HadoopSecurityManager to
+authenticate with KDC:
+
+::
+
+   proxy.user=YOUR_AZKABAN_KERBEROS_PRINCIPAL
+
+This principal should also be set in core-site.xml in Hadoop conf with
+corresponding permissions.
+
+::
+
+   proxy.keytab.location=KEYTAB_LOCATION
+
+One should verify if the proxy user and keytab works with the specified
+KDC.
+
+**Obtaining tokens for user jobs**
+
+Here are what's common for most Hadoop jobs
+
+::
+
+   hadoop.security.manager.class=azkaban.security.HadoopSecurityManager_H_1_0
+
+This implementation should work with Hadoop 1.x
+
+::
+
+   azkaban.should.proxy=true
+   obtain.binary.token=true
+   obtain.namenode.token=true
+   obtain.jobtracker.token=true
+
+Additionally, if your job needs to talk to HCat, for example if you have
+Hive installed with uses kerbrosed HCat, or your pig job needs to talk
+to HCat, you will need to set for those Hive job types
+
+::
+
+   obtain.hcat.token=true
+
+This makes HadoopSecurityManager acquire a HCat token as well.
+
+Making a New Job Type on Secure Hadoop Cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you are making a new job type that will talk to Hadoop Cluster, you
+can use the HadoopSecurityManager to take care of security.
+
+For unsecure Hadoop cluster, there is nothing special that is needed.
+
+For secure Hadoop clusters, there are two ways inlcuded in the
+hadoopsecuritymanager package:
+
+-  give the key tab information to user job process. The
+   hadoopsecuritymanager static method takes care of login from that
+   common keytab and proxy to the user. This is convenient for
+   prototyping as there will be a real tgt granted to the user job. The
+   con side is that the user could potentially use the keytab to login
+   and proxy as someone else, which presents a security hole.
+-  obtain Hadoop tokens prior to user job process start. The job wrapper
+   will pick up these binary tokens inside user job process. The tokens
+   should be explicitly cancelled after user job finishes.
+
+By paring properly configured hadoopsecuritymanager with basic job types
+such as hadoopJava, pig, hive, one can make these job types work with
+different versions of Hadoop with various security settings.
+
+Included in the azkaban-plugins is the hadoopsecuritymanager for
+Hadoop-1.x versions. It is not compatible with Hadoop-0.20 and prior
+versions as Hadoop UGI is not backwards compatible. However, it should
+not be difficult to implement one that works with them. Going forward,
+Hadoop UGI is mostly backwards compatible and one only needs to
+recompile hadoopsecuritymanager package with newer versions of Hadoop.
+
+.. _hdfs-browser:
+
+Azkaban HDFS Browser
+--------------------
+
+The Azkaban HDFS Browser is a plugin that allows you to view the HDFS
+FileSystem and decode several file types. It was originally created at
+LinkedIn to view Avro files, Linkedin's BinaryJson format and text
+files. As this plugin matures further, we may add decoding of different
+file types in the future.
+
+.. image:: figures/hdfsbrowser.png
+
+Setup
+~~~~~
+..
+	TODO:Fix download page
+	
+Download the HDFS plugin from `the download
+page <%7B%7B%20site.home%20%7D%7D/downloads.html>`__ and extract it into
+the web server's plugin's directory. This is often
+``azkaban_web_server_dir/plugins/viewer/``.
+
+**Users**
+
+By default, Azkaban HDFS browser does a do-as to impersonate the
+logged-in user. Often times, data is created and handled by a headless
+account. To view these files, if user proxy is turned on, then the user
+can switch to the headless account as long as its validated by the
+UserManager.
+
+**Settings**
+
+These are properties to configure the HDFS Browser on the
+AzkabanWebServer. They can be set in
+``azkaban_web_server_dir/plugins/viewer/hdfs/conf/plugin.properties``.
+
++-----------------------+-----------------------+-----------------------+
+| Parameter             | Description           | Default               |
++=======================+=======================+=======================+
+| viewer.name           | The name of this      | HDFS                  |
+|                       | viewer plugin         |                       |
++-----------------------+-----------------------+-----------------------+
+| viewer.path           | The path to this      | hdfs                  |
+|                       | viewer plugin inside  |                       |
+|                       | viewer directory.     |                       |
++-----------------------+-----------------------+-----------------------+
+| viewer.order          | The order of this     | 1                     |
+|                       | viewer plugin amongst |                       |
+|                       | all viewer plugins.   |                       |
++-----------------------+-----------------------+-----------------------+
+| viewer.hidden         | Whether this plugin   | false                 |
+|                       | should show up on the |                       |
+|                       | web UI.               |                       |
++-----------------------+-----------------------+-----------------------+
+| viewer.external.class | Extra jars this       | extlib/\*             |
+| path                  | viewer plugin should  |                       |
+|                       | load upon init.       |                       |
++-----------------------+-----------------------+-----------------------+
+| viewer.servlet.class  | The main servelet     |                       |
+|                       | class for this viewer |                       |
+|                       | plugin. Use           |                       |
+|                       | ``azkaban.viewer.hdfs |                       |
+|                       | .HdfsBrowserServlet`` |                       |
+|                       | for hdfs browser      |                       |
++-----------------------+-----------------------+-----------------------+
+| hadoop.security.manag | The class that        |                       |
+| er.class              | handles talking to    |                       |
+|                       | hadoop clusters. Use  |                       |
+|                       | ``azkaban.security.Ha |                       |
+|                       | doopSecurityManager_H |                       |
+|                       | _1_0``                |                       |
+|                       | for hadoop 1.x        |                       |
++-----------------------+-----------------------+-----------------------+
+| azkaban.should.proxy  | Whether Azkaban       | false                 |
+|                       | should proxy as       |                       |
+|                       | individual user       |                       |
+|                       | hadoop accounts on a  |                       |
+|                       | secure cluster,       |                       |
+|                       | defaults to false     |                       |
++-----------------------+-----------------------+-----------------------+
+| proxy.user            | The Azkaban user      |                       |
+|                       | configured with       |                       |
+|                       | kerberos and hadoop.  |                       |
+|                       | Similar to how oozie  |                       |
+|                       | should be configured, |                       |
+|                       | for secure hadoop     |                       |
+|                       | installations         |                       |
++-----------------------+-----------------------+-----------------------+
+| proxy.keytab.location | The location of the   |                       |
+|                       | keytab file with      |                       |
+|                       | which Azkaban can     |                       |
+|                       | authenticate with     |                       |
+|                       | Kerberos for the      |                       |
+|                       | specified proxy.user  |                       |
++-----------------------+-----------------------+-----------------------+
+| allow.group.proxy     | Whether to allow      | false                 |
+|                       | users in the same     |                       |
+|                       | headless user group   |                       |
+|                       | to view hdfs          |                       |
+|                       | filesystem as that    |                       |
+|                       | headless user         |                       |
++-----------------------+-----------------------+-----------------------+
+
+.. _jobtype-plugins:
+
+JobType Plugins
+---------------
+
+Azkaban Jobtype Plugins Configurations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These are properties to configure the jobtype plugins that are installed
+with the AzkabanExecutorServer. Note that Azkaban uses the directory
+structure to infer global settings versus individual jobtype specific
+settings. Sub-directory names also determine the job type name for
+running Azkaban instances.
+
+**Introduction**
+
+
+Jobtype plugins determine how individual jobs are actually run locally
+or on a remote cluster. It gives great benefits: one can add or change
+any job type without touching Azkaban core code; one can easily extend
+Azkaban to run on different hadoop versions or distributions; one can
+keep old versions around while adding new versions of the same types.
+However, it is really up to the admin who manages these plugins to make
+sure they are installed and configured correctly.
+
+Upon AzkabanExecutorServer start up, Azkaban will try to load all the
+job type plugins it can find. Azkaban will do very simply tests and drop
+the bad ones. One should always try to run some test jobs to make sure
+the job types really work as expected.
+
+**Global Properties**
+
+One can pass global settings to all job types, including cluster
+dependent settings that will be used by all job types. These settings
+can also be specified in each job type's own settings as well.
+
+**Private settings**
+
+One can pass global settings that are needed by job types but should not
+be accessible by user code in ``commonprivate.properties``. For example,
+the following settings are often needed for a hadoop cluster:
+
++-----------------------------------+-----------------------------------+
+| Parameter                         | Description                       |
++===================================+===================================+
+| hadoop.security.manager.class     | The hadoopsecuritymanager that    |
+|                                   | handles talking to a hadoop       |
+|                                   | cluseter. Use                     |
+|                                   | ``azkaban.security.HadoopSecurity |
+|                                   | Manager_H_1_0``                   |
+|                                   | for 1.x versions                  |
++-----------------------------------+-----------------------------------+
+| azkaban.should.proxy              | Whether Azkaban should proxy as   |
+|                                   | individual user hadoop accounts,  |
+|                                   | or run as the Azkaban user        |
+|                                   | itself, defaults to ``true``      |
++-----------------------------------+-----------------------------------+
+| proxy.user                        | The Azkaban user configured with  |
+|                                   | kerberos and hadoop. Similar to   |
+|                                   | how oozie should be configured,   |
+|                                   | for secure hadoop installations   |
++-----------------------------------+-----------------------------------+
+| proxy.keytab.location             | The location of the keytab file   |
+|                                   | with which Azkaban can            |
+|                                   | authenticate with Kerberos for    |
+|                                   | the specified proxy.user          |
++-----------------------------------+-----------------------------------+
+| jobtype.global.classpath          | The jars or xml resources every   |
+|                                   | job type should have on their     |
+|                                   | classpath. (e.g.                  |
+|                                   | ``${hadoop.home}/hadoop-core-1.0. |
+|                                   | 4.jar,${hadoop.home}/conf``)      |
++-----------------------------------+-----------------------------------+
+| jobtype.global.jvm.args           | The jvm args that every job type  |
+|                                   | should have to jvm.               |
++-----------------------------------+-----------------------------------+
+| hadoop.home                       | The ``$HADOOP_HOME`` setting.     |
++-----------------------------------+-----------------------------------+
+
+**Public settings**
+
+One can pass global settings that are needed by job types and can be
+visible by user code, in ``common.properties``. For example,
+``hadoop.home`` should normally be passed along to user programs.
+
+**Settings for individual job types**
+
+In most cases, there is no extra settings needed for job types to work,
+other than variables like ``hadoop.home``, ``pig.home``, ``hive.home``,
+etc. However, it is also where most of the customizations come from. For
+example, one can configure a two Java job types with the same jar
+resources but with different hadoop configurations, thereby submitting
+pig jobs to different clusters. One can also configure pig job with
+pre-registered jars and namespace imports for specific organizations.
+Also to be noted: in the list of common job type plugins, we have
+included different pig versions. The admin needs to make a soft link to
+one of them, such as
+
+::
+
+   $ ln -s pig-0.10.1 pig
+
+so that the users can use a default "pig" type.