azkaban-aplcache

project LRU cache part 2 (#1865) The problem this PR targets …

7/25/2018 8:25:28 PM

to solve is detailed in #1803

Drawback of previous design is detailed in #1841

New design:

Create a file in each project directory and write the size of the project to the file when the project is created. The project files are not supposed to change after creation. Touch this file each time the project is used. This way, we can have a more efficient LRU algorithm based on last access time, not creation time.

Maintain the total size of the project cache in memory to avoid the overhead of re-calculating it. The size shouldn't change too often. This way we can afford to run the check more frequently. Project dir size check and corresponding deletion will be performed when a new project is downloaded.

This PR implements part 2.

Next step is to shorten execution dir retention period to really free up space, given there's always a hard link from execution to project directory.

Cheng Ren

Commit: 0408919

Tree: 58d0026

Parents: c6509ee

Include error stacktrace in updater thread logging (#1867) It …

7/25/2018 7:05:58 PM

was calling log.error(Object message), so we didn't get a stack trace at all. For example this was logged, which is not too helpful: ERROR [ExecutorManager] [Azkaban] java.lang.NullPointerException (the value of e.toString(), as you can see).

Juho Autio

Commit: c6509ee

Tree: 622487f

Parents: d43c783

Event trigger PR (#1858) This PR is to demonstrate how we can …

7/25/2018 12:29:26 AM

achieve a new feature to trigger workflows by events.
A trigger instance will wait for all dependency instances are available so as to kick-off until reaching the maximum waiting time. This feature enables users to specify all the dependencies <Topic, Event> pair. By the Trigger.flow example provided below, the user scheduled a flow depends on both event with regular expression match of rules "." and "hadoop?.". All we can do is matching a Kafka event with Regex, however, one can implement an interface so as to extend this feature.

# Flow Trigger Example triggerDependencies: - name: dep1 # an unique name to identify the dependency type: kafka params: match: .* topic: AzEvent_Topic4 - name: dep2 # an unique name to identify the dependency type: kafka params: match: hadoop?.* topic: AzEvent_Topic4

Chiawei Chang

Commit: d43c783

Tree: ee5a0c4

Parents: 79a6a1f

Fix NPE in FetchActiveFlowDao (#1862) Ignore active executions …

7/21/2018 9:34:31 PM

with flow_data=null.

For error details, see #1833 (comment).

Cleaning these up manually would require too many changes, because ExecutableFlow object is used in many places.

Note that it's still possible to end up having this kind of rows in the DB if the application fails between creation of an execution id in execution_flows and uploading the data. See here:

azkaban/azkaban-common/src/main/java/azkaban/executor/ExecutionFlowDao.java

Lines 71 to 74 in 2be81eb final long id = this.dbOperator.transaction(insertAndGetLastID); logger.info("Flow given " + flow.getFlowId() + " given id " + id); flow.setExecutionId((int) id); updateExecutableFlow(flow); I'm not sure if doing both inside a transaction is even possible, so I rather just avoid the NPE and log a warning about this corner case that should be rare.

This weakness was in place already before my change #1833. But before my change it could have only caused an NPE if the same executor id is still used after missing to update the flow data (meaning, using a fixed executor id, or starting up an executor after a sudden crash where the executor didn't get to remove itself from the DB as a part of the shutdown hook).

Juho Autio

Commit: 79a6a1f

Tree: a0bb4e1

Parents: 7b264b0

fix README (#1863) * Move therequirements.txt to the root …

7/19/2018 9:01:29 PM

directory to consolidate duplicate one.

* improved README documents to welcome more people to contribute to documentation

Liang Tang

Commit: 7b264b0

Tree: e82bc71

Parents: 23a5fe7

Set up new Documentation toolset and publish it to readTheDocs …

7/19/2018 2:14:40 PM

3.50.2

(#1861)

This PR proposes a new toolset to do Azkaban documentation. We use Sphinx and ReStructuredText to write down docs, and publish it to readTheDocs. README is included to educate users how to develop the documentation.

Getting started section is rewritten and make sure it is up to date.

Liang Tang

Commit: 23a5fe7

Tree: 54380dd

Parents: 2a4c4b4

add default configurations to help local set-up (#1860) Check-in …

7/19/2018 2:14:15 PM

Default configurations to set up a multi-executor instance locally, so that users are able to follow the new docs to set up one easily.

Liang Tang

Commit: 2a4c4b4

Tree: 3fc32d2

Parents: 02f8351

Condition on job status - Step 1 (#1854) * Condition on job …

7/18/2018 5:43:57 PM

status - step 1

* Add more comments. Change pattern to case insensitive.

Jamie Sun

Commit: 02f8351

Tree: ddd18f0

Parents: 79fd83d

project LRU cache part 1 (#1848) The problem this PR targets …

7/17/2018 3:03:52 PM

to solve is detailed in #1803

Drawback of previous design is detailed in #1841

New design:

Create a file in each project directory and write the size of the project to the file when the project is created. The project files are not supposed to change after creation. Touch this file each time the project is used. This way, we can have a more efficient LRU algorithm based on last access time, not creation time.
Maintain the total size of the project cache in memory to avoid the overhead of re-calculating it. The size shouldn't change too often. This way we can afford to run the check more frequently.
This PR implements part 1.

Cheng Ren

Commit: 79fd83d

Tree: 33cdf21

Parents: e85075c

Finalize running flows without matching executor (#1833) A …

7/16/2018 11:57:47 PM

more complete cleanup: don't leave executions whose executor doesn't exist any more in some interim state like RUNNING - finalize them as FAILED.

To explain the problem scenario, it goes like this:

if an executor is stopped gracefully (so that it has time to finish at least some shutdown hooks), it removes itself from the executors table
however, it's still possible that not all executions on that executor were properly finalized ie. status of each interrupted execution is set to one of FAILED/KILLED/SUCCESS
when an executor is started, it adds itself into the executors table with an executor id that is the next running number
As a result, there can be executions with status RUNNING (for example) that would never be marked as FAILED, because the executor id of those executions is not found in executors table any more. This PR fixes that: the status of such executions are finalized.

This is only a concern in multi-executor mode. To be noted, it would be good to support only multi-exec mode in the future. See also description of #1831.

Juho Autio

Commit: e85075c

Tree: 772c7d8

Parents: 66307c0