azkaban-aplcache
Details
docs/howTo.rst 74(+74 -0)
diff --git a/docs/howTo.rst b/docs/howTo.rst
new file mode 100644
index 0000000..c6e810b
--- /dev/null
+++ b/docs/howTo.rst
@@ -0,0 +1,74 @@
+.. _how-to:
+
+How Tos
+=======
+
+Force execution to an executor
+------------------------------
+
+Only users with admin privileges can use this override. In flow params:
+set ``"useExecutor" = EXECUTOR_ID``.
+
+Setting flow priority in multiple executor mode
+-----------------------------------------------
+
+Only users with admin privileges can use this property. In flow params:
+set ``"flowPriority" = PRIORITY``. Higher numbers get executed first.
+
+Enabling and Disabling Queue in multiple executor mode
+------------------------------------------------------
+
+Only users with admin privileges can use this action. Use curl or simply
+visit following URL:-
+
+- Enable: ``WEBSERVER_URL/executor?ajax=disableQueueProcessor``
+- Disable: ``WEBSERVER_URL/executor?ajax=enableQueueProcessor``
+
+Reloading executors in multiple executor mode
+---------------------------------------------
+
+Only users with admin privileges can use this action. This action need
+at least one active executor to be successful. Use curl or simply visit
+following URL:- ``WEBSERVER_URL/executor?ajax=reloadExecutors``
+
+Logging job logs to a Kafka cluster
+-----------------------------------
+
+Azkaban supports sending job logs to a log ingestion (such as ELK)
+cluster via a Kafka appender. In order to enable this in Azkaban, you
+will need to set two exec server properties (shown here with sample
+values):
+
+.. code-block:: guess
+
+ azkaban.server.logging.kafka.brokerList=localhost:9092
+ azkaban.server.logging.kafka.topic=azkaban-logging
+
+These configure where Azkaban can find your Kafka cluster, and also
+which topic to put the logs under. Failure to provide these parameters
+will result in Azkaban refusing to create a Kafka appender upon
+requesting one.
+
+In order to configure a job to send its logs to Kafka, the following job
+property needs to be set to true:
+
+.. code-block:: guess
+
+ azkaban.job.logging.kafka.enable=true
+
+Jobs with this setting enabled will broadcast its log messages in JSON
+form to the Kafka cluster. It has the following structure:
+
+.. code-block:: guess
+
+ {
+ "projectname": "Project name",
+ "level": "INFO or ERROR",
+ "submituser": "Someone",
+ "projectversion": "Project version",
+ "category": "Class name",
+ "message": "Some log message",
+ "logsource": "userJob",
+ "flowid": "ID of flow",
+ "execid": "ID of execution"
+ }
docs/index.rst 2(+2 -0)
diff --git a/docs/index.rst b/docs/index.rst
index ffac5aa..7086361 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -37,6 +37,8 @@ Features
useAzkaban
eventTrigger
ajaxApi
+ howTo
+ plugins
docs/plugins.rst 411(+411 -0)
diff --git a/docs/plugins.rst b/docs/plugins.rst
new file mode 100644
index 0000000..482f81c
--- /dev/null
+++ b/docs/plugins.rst
@@ -0,0 +1,411 @@
+Plugins
+========
+..
+ TODO:Fix download page
+
+Azkaban is designed to be modular. We are able to plug in code to add
+viewer pages or execute jobs in a customizable manner. These pages will
+describe the azkaban-plugins that can be downloaded from `the download
+page <%7B%7B%20site.home%20%7D%7D/downloads.html>`__ and how to extend
+Azkaban by creating your own plugins or extending an existing one.
+
+.. _hadoopsecuritymanager:
+
+HadoopSecurityManager
+---------------------------
+
+The most common adoption of Azkaban has been in the big data platforms
+such as Hadoop, etc. Azkaban's jobtype plugin system allows most
+flexible support to such systems.
+
+Azkaban is able to support all Hadoop versions, with support for Hadoop
+security features; Azkaban is able to support various ecosystem
+components with all different versions, such as different versions of
+pig, hive, on the same instance.
+
+A common pattern to achieve this is by using the
+``HadoopSecurityManager`` class, which handles talking to a Hadoop
+cluster and take care of Hadoop security, in a secure way.
+
+Hadoop Security with Kerberos, Hadoop Tokens
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When Hadoop is used in enterprise production environment, it is
+advisable to have its security feature turned on, to protect your data
+and guard against mistakes.
+
+**Kerberos Authentication**
+
+The most common authentication provided by Apache Hadoop is via
+Kerberos, which requires a KDC to authenticate users and services.
+
+A user can authenticate with KDC via username/password or use a keytab.
+KDC distributes a tgt to authenticated users. Hadoop services, such as
+name node and job tracker, can use this tgt to verify this is
+authenticated user.
+
+**Hadoop Tokens**
+
+Once a user is authenticated with Hadoop services, Hadoop will issue
+tokens to the user so that its internal services won't flood KDC. For a
+description of tokens, see
+`here <http://hortonworks.com/blog/the-role-of-delegation-tokens-in-apache-hadoop-security/>`__.
+
+**Hadoop SecurityManager**
+
+For human users, one authenticate with KDC with a kinit command. But for
+scheduler such as Azkaban that runs jobs on behalf as other users, it
+needs to acquire tokens that will be used by the users. Specific Azkaban
+job types should handle this, with the use of ``HadoopSecurityManager``
+class.
+
+For instance, when Azkaban loads the pig job type, it will initiate a
+HadoopSecurityManager that is authenticated with the desired KDC and
+Hadoop Cluster. The pig job type conf should specify which tokens are
+needed to talk to different services. At minimum it needs tokens from
+name node and job tracker. When a pig job starts, it will go to the
+HadoopSecurityManager to acquire all those tokens. When the user process
+finishes, the pig job type calls HadoopSecurityManager again to cancel
+all those tokens.
+
+Settings Common to All Hadoop Clusters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a user program wants to talk to a Hadoop cluster, it needs to know
+where are the name node and job tracker. It also needs to know how to
+authenticate with them. These information are all in the Hadoop config
+files that are normally in ``$HADOOP_HOME/conf``. For this reason, this
+conf directory as well as the hadoop-core jar need to be on azkaban
+executor server classpath.
+
+If you are using Hive that uses HCat as its metastore, you also need
+relevant hive jars and hive conf on the classpath as well.
+
+**Native Library**
+
+Most likely your Hadoop platform depends on some native library, this
+should be specified in java.library.path in azkaban executor server.
+
+**temp dir**
+
+Besides those, many tools on Hadoop, such as Pig/Hive/Crunch write files
+into temporary directory. By default, they all go to ``/tmp``. This
+could cause operations issue when a lot of jobs run concurrently.
+Because of this, you may want to change this by setting
+``java.io.tmp.dir`` to a different directory.
+
+Settings To Talk to UNSECURE Hadoop Cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you are just starting out with Hadoop, chances are you don't have
+kerberos authentication for your Hadoop. Depending on whether you want
+to run everything as azkaban user (or whatever user started the azkaban
+executor server), you can do the following settings:
+
+- If you started the executor server with user named azkaban, and you
+ want to run all the jobs as azkaban on Hadoop, just set
+ ``azkaban.should.proxy=false`` and ``obtain.binary.token=false``
+- If you started the executor server with user named azkaban, but you
+ want to run Hadoop jobs as their individual users, you need to set
+ ``azkaban.should.proxy=true`` and ``obtain.binary.token=false``
+
+Settings To Talk to SECURE Hadoop Cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For secure Hadoop clusters, Azkaban needs its own kerberos keytab to
+authenticate with KDC. Azkaban job types should acquire necessary Hadoop
+tokens before user job process starts, and should cancel the tokens
+after user job finishes.
+
+All job type specific settings should go to their respective plugin conf
+files. Some of the common settings can go to commonprivate.properties
+and common.properties.
+
+For instance, Hadoop job types usually require name node tokens and job
+tracker tokens. These can go to commonpriate.properties.
+
+**Azkaban as proxy user**
+
+The following settings are needed for HadoopSecurityManager to
+authenticate with KDC:
+
+::
+
+ proxy.user=YOUR_AZKABAN_KERBEROS_PRINCIPAL
+
+This principal should also be set in core-site.xml in Hadoop conf with
+corresponding permissions.
+
+::
+
+ proxy.keytab.location=KEYTAB_LOCATION
+
+One should verify if the proxy user and keytab works with the specified
+KDC.
+
+**Obtaining tokens for user jobs**
+
+Here are what's common for most Hadoop jobs
+
+::
+
+ hadoop.security.manager.class=azkaban.security.HadoopSecurityManager_H_1_0
+
+This implementation should work with Hadoop 1.x
+
+::
+
+ azkaban.should.proxy=true
+ obtain.binary.token=true
+ obtain.namenode.token=true
+ obtain.jobtracker.token=true
+
+Additionally, if your job needs to talk to HCat, for example if you have
+Hive installed with uses kerbrosed HCat, or your pig job needs to talk
+to HCat, you will need to set for those Hive job types
+
+::
+
+ obtain.hcat.token=true
+
+This makes HadoopSecurityManager acquire a HCat token as well.
+
+Making a New Job Type on Secure Hadoop Cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you are making a new job type that will talk to Hadoop Cluster, you
+can use the HadoopSecurityManager to take care of security.
+
+For unsecure Hadoop cluster, there is nothing special that is needed.
+
+For secure Hadoop clusters, there are two ways inlcuded in the
+hadoopsecuritymanager package:
+
+- give the key tab information to user job process. The
+ hadoopsecuritymanager static method takes care of login from that
+ common keytab and proxy to the user. This is convenient for
+ prototyping as there will be a real tgt granted to the user job. The
+ con side is that the user could potentially use the keytab to login
+ and proxy as someone else, which presents a security hole.
+- obtain Hadoop tokens prior to user job process start. The job wrapper
+ will pick up these binary tokens inside user job process. The tokens
+ should be explicitly cancelled after user job finishes.
+
+By paring properly configured hadoopsecuritymanager with basic job types
+such as hadoopJava, pig, hive, one can make these job types work with
+different versions of Hadoop with various security settings.
+
+Included in the azkaban-plugins is the hadoopsecuritymanager for
+Hadoop-1.x versions. It is not compatible with Hadoop-0.20 and prior
+versions as Hadoop UGI is not backwards compatible. However, it should
+not be difficult to implement one that works with them. Going forward,
+Hadoop UGI is mostly backwards compatible and one only needs to
+recompile hadoopsecuritymanager package with newer versions of Hadoop.
+
+.. _hdfs-browser:
+
+Azkaban HDFS Browser
+--------------------
+
+The Azkaban HDFS Browser is a plugin that allows you to view the HDFS
+FileSystem and decode several file types. It was originally created at
+LinkedIn to view Avro files, Linkedin's BinaryJson format and text
+files. As this plugin matures further, we may add decoding of different
+file types in the future.
+
+.. image:: figures/hdfsbrowser.png
+
+Setup
+~~~~~
+..
+ TODO:Fix download page
+
+Download the HDFS plugin from `the download
+page <%7B%7B%20site.home%20%7D%7D/downloads.html>`__ and extract it into
+the web server's plugin's directory. This is often
+``azkaban_web_server_dir/plugins/viewer/``.
+
+**Users**
+
+By default, Azkaban HDFS browser does a do-as to impersonate the
+logged-in user. Often times, data is created and handled by a headless
+account. To view these files, if user proxy is turned on, then the user
+can switch to the headless account as long as its validated by the
+UserManager.
+
+**Settings**
+
+These are properties to configure the HDFS Browser on the
+AzkabanWebServer. They can be set in
+``azkaban_web_server_dir/plugins/viewer/hdfs/conf/plugin.properties``.
+
++-----------------------+-----------------------+-----------------------+
+| Parameter | Description | Default |
++=======================+=======================+=======================+
+| viewer.name | The name of this | HDFS |
+| | viewer plugin | |
++-----------------------+-----------------------+-----------------------+
+| viewer.path | The path to this | hdfs |
+| | viewer plugin inside | |
+| | viewer directory. | |
++-----------------------+-----------------------+-----------------------+
+| viewer.order | The order of this | 1 |
+| | viewer plugin amongst | |
+| | all viewer plugins. | |
++-----------------------+-----------------------+-----------------------+
+| viewer.hidden | Whether this plugin | false |
+| | should show up on the | |
+| | web UI. | |
++-----------------------+-----------------------+-----------------------+
+| viewer.external.class | Extra jars this | extlib/\* |
+| path | viewer plugin should | |
+| | load upon init. | |
++-----------------------+-----------------------+-----------------------+
+| viewer.servlet.class | The main servelet | |
+| | class for this viewer | |
+| | plugin. Use | |
+| | ``azkaban.viewer.hdfs | |
+| | .HdfsBrowserServlet`` | |
+| | for hdfs browser | |
++-----------------------+-----------------------+-----------------------+
+| hadoop.security.manag | The class that | |
+| er.class | handles talking to | |
+| | hadoop clusters. Use | |
+| | ``azkaban.security.Ha | |
+| | doopSecurityManager_H | |
+| | _1_0`` | |
+| | for hadoop 1.x | |
++-----------------------+-----------------------+-----------------------+
+| azkaban.should.proxy | Whether Azkaban | false |
+| | should proxy as | |
+| | individual user | |
+| | hadoop accounts on a | |
+| | secure cluster, | |
+| | defaults to false | |
++-----------------------+-----------------------+-----------------------+
+| proxy.user | The Azkaban user | |
+| | configured with | |
+| | kerberos and hadoop. | |
+| | Similar to how oozie | |
+| | should be configured, | |
+| | for secure hadoop | |
+| | installations | |
++-----------------------+-----------------------+-----------------------+
+| proxy.keytab.location | The location of the | |
+| | keytab file with | |
+| | which Azkaban can | |
+| | authenticate with | |
+| | Kerberos for the | |
+| | specified proxy.user | |
++-----------------------+-----------------------+-----------------------+
+| allow.group.proxy | Whether to allow | false |
+| | users in the same | |
+| | headless user group | |
+| | to view hdfs | |
+| | filesystem as that | |
+| | headless user | |
++-----------------------+-----------------------+-----------------------+
+
+.. _jobtype-plugins:
+
+JobType Plugins
+---------------
+
+Azkaban Jobtype Plugins Configurations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These are properties to configure the jobtype plugins that are installed
+with the AzkabanExecutorServer. Note that Azkaban uses the directory
+structure to infer global settings versus individual jobtype specific
+settings. Sub-directory names also determine the job type name for
+running Azkaban instances.
+
+**Introduction**
+
+
+Jobtype plugins determine how individual jobs are actually run locally
+or on a remote cluster. It gives great benefits: one can add or change
+any job type without touching Azkaban core code; one can easily extend
+Azkaban to run on different hadoop versions or distributions; one can
+keep old versions around while adding new versions of the same types.
+However, it is really up to the admin who manages these plugins to make
+sure they are installed and configured correctly.
+
+Upon AzkabanExecutorServer start up, Azkaban will try to load all the
+job type plugins it can find. Azkaban will do very simply tests and drop
+the bad ones. One should always try to run some test jobs to make sure
+the job types really work as expected.
+
+**Global Properties**
+
+One can pass global settings to all job types, including cluster
+dependent settings that will be used by all job types. These settings
+can also be specified in each job type's own settings as well.
+
+**Private settings**
+
+One can pass global settings that are needed by job types but should not
+be accessible by user code in ``commonprivate.properties``. For example,
+the following settings are often needed for a hadoop cluster:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| hadoop.security.manager.class | The hadoopsecuritymanager that |
+| | handles talking to a hadoop |
+| | cluseter. Use |
+| | ``azkaban.security.HadoopSecurity |
+| | Manager_H_1_0`` |
+| | for 1.x versions |
++-----------------------------------+-----------------------------------+
+| azkaban.should.proxy | Whether Azkaban should proxy as |
+| | individual user hadoop accounts, |
+| | or run as the Azkaban user |
+| | itself, defaults to ``true`` |
++-----------------------------------+-----------------------------------+
+| proxy.user | The Azkaban user configured with |
+| | kerberos and hadoop. Similar to |
+| | how oozie should be configured, |
+| | for secure hadoop installations |
++-----------------------------------+-----------------------------------+
+| proxy.keytab.location | The location of the keytab file |
+| | with which Azkaban can |
+| | authenticate with Kerberos for |
+| | the specified proxy.user |
++-----------------------------------+-----------------------------------+
+| jobtype.global.classpath | The jars or xml resources every |
+| | job type should have on their |
+| | classpath. (e.g. |
+| | ``${hadoop.home}/hadoop-core-1.0. |
+| | 4.jar,${hadoop.home}/conf``) |
++-----------------------------------+-----------------------------------+
+| jobtype.global.jvm.args | The jvm args that every job type |
+| | should have to jvm. |
++-----------------------------------+-----------------------------------+
+| hadoop.home | The ``$HADOOP_HOME`` setting. |
++-----------------------------------+-----------------------------------+
+
+**Public settings**
+
+One can pass global settings that are needed by job types and can be
+visible by user code, in ``common.properties``. For example,
+``hadoop.home`` should normally be passed along to user programs.
+
+**Settings for individual job types**
+
+In most cases, there is no extra settings needed for job types to work,
+other than variables like ``hadoop.home``, ``pig.home``, ``hive.home``,
+etc. However, it is also where most of the customizations come from. For
+example, one can configure a two Java job types with the same jar
+resources but with different hadoop configurations, thereby submitting
+pig jobs to different clusters. One can also configure pig job with
+pre-registered jars and namespace imports for specific organizations.
+Also to be noted: in the list of common job type plugins, we have
+included different pig versions. The admin needs to make a soft link to
+one of them, such as
+
+::
+
+ $ ln -s pig-0.10.1 pig
+
+so that the users can use a default "pig" type.