azkaban-developers

Revert "New status: KILLING (#1172)" (#1300) This reverts …

8/1/2017 3:14:21 PM

3.32.0

commit d1b836a.

Conflicts:
azkaban-common/src/main/java/azkaban/executor/Status.java
azkaban-exec-server/src/main/java/azkaban/execapp/JobRunner.java

From @jamiesjc: When we kill the flow immediately after it starts in our integration test, it couldn't be killed due to some race condition. Users will see on the execution page that the job is in KILLING status but it actually never gets killed. And users cannot click kill button again during the KILLING period.
At the time when we kill, the process might not have started yet or the jobRunner has not yet been added to the activeJobRunners.
You can check more details in the PR description: #1289. We are still working on the fix.

The intention is to reintroduce this commit once the underlying bug is fixed.

Charlie Summers

Commit: 92aff73

Tree: a9de4a4

Parents: cb12948

delete rentention logs every hour (#1299) * delete rentention …

7/31/2017 6:38:27 PM

logs every hour

We observed the long deletion transaction in our production Azkaban cluster. In mysql, if delete transaction is large (The data going to be deleted is too big), it occupies too much undo space (buffer), which affect other transactions and other databases in this host. So we change every day one time's deletion to every hour.

Liang Tang

Commit: cb12948

Tree: 434d0dd

Parents: c0516e4

Delete executions directory on executor shutdown to reclaim …

7/28/2017 9:46:55 PM

disk space (#1295)

The execution directory can bloat to be relatively large, which could lead to problems if the executor is shutdown and stale executions remain sitting on the executor occupying disk space.

This change deletes the execution directory on executor shutdown. The directory itself is created on executor initialization, so the directory isn't necessary to reset the executor later.

Note that the execution directory will not be deleted until AFTER all flows are done if the executor is brought down via shutdown() - so it will not delete partially-written logs before they are uploaded to the database.

Charlie Summers

Commit: c0516e4

Tree: fe0b0ea

Parents: 3017d80

Running chmod g+s on executions directory to enable proper …

7/28/2017 1:38:48 PM

execution cleanup (#1292)

Setting the gid bit on the group for a directory causes all items created within that directory to inherit the group of the directory. So all items users create in their /execution/<exec_id> directory will automatically be a part of the azkaban group. This allows the azkaban cleanup thread to properly remove user-generated files/directories.

The gid bit is system-specific, so there is no java standard library api for setting it. The solution I proposed spawns a subprocess that performs the chmod command. I don't think this is very clean, but I haven't been able to find a better way. Anybody have any ideas?

Another option would be to build this executionDirectory as part of our build process, but that isn't how we do things right now (tested by running deploy on holdem4 jenkins job without any of the other steps and verifying no /executions directory is created until system start).

Note that this change is not covered by testing due to the difficulty of dealing simultaneously with filesystems and subprocesses within a testing environment. If disk space usage increases in the future, it has possible that this change has been regressed. In order to confirm that this change is working in production on particularly large clusters (where problems have been seen), I'll be keeping an eye on it when it's released.

Charlie Summers

Commit: 3017d80

Tree: e40a431

Parents: 9d079ec

remove unused member variables in Session and User class (#1294) this …

7/27/2017 10:46:17 PM

would reduce the memory overhead for session to some extend so that we can increase the size of session cache more.

Cheng Ren

Commit: 9d079ec

Tree: bca6672

Parents: 0f88172

Add metrics for sending email successs/failure. (#1287) * …

7/27/2017 8:26:53 PM

Add metrics for sending email successs/failure.

* Guicify Emailer class and refactor some test cases.

jamiesjc

Commit: 0f88172

Tree: 2583ac1

Parents: 06d4d4d

Fix value out of range exception when storing CANCELLED status …

7/27/2017 6:49:31 PM

in DB. (#1288)

jamiesjc

Commit: 06d4d4d

Tree: 19ba0c5

Parents: 5391230

move singleton to classes's annotations (#1286) * move singleton …

7/25/2017 6:40:06 PM

to classes's annotations

This patch refactors the Guice uses, and mainly move all singleton
binding to respective classes with singleton annotations. The corresponding
tests are added as well. A bit more context is at #1285 .

Liang Tang

Commit: 5391230

Tree: e152e80

Parents: 4bd7523

New SLA action type for killing a job (#1092) New SLA action …

7/25/2017 12:31:14 AM

type for killing a job
The action is to kill a job and retry it based on the retry configuration of that job.
Previously only killing a flow is allowed when SLA is missed even if SLA is set on job level.
New action will kick in when user sets SLA rule on a job and enforce kill action on missing the SLA. There's no UI change.

Testing it manually with following flows:
jobA(retry num: 2)->jobB(retry num: 2)
SLA rule: if job A doesn't succeed in 1 min, kill the job
SLA rule: if job B doesn't succeed in 1 min, kill the job

jobA->jobB, jobA->jobC, jobB->jobD, B retry number is set to 2.
SLA rule: if job B doesn't succeed in 1 min, kill it.

Cheng Ren

Commit: 4bd7523

Tree: b15acdf

Parents: 8c67c59

Add metrics to indicate flow dispatch failure/success count …

7/24/2017 2:22:05 PM

(#1279)

jamiesjc

Commit: 8c67c59

Tree: f376e42

Parents: 65ff397