8/1/2017 3:14:21 PM
3.32.0
commit d1b836a.
Conflicts:
azkaban-common/src/main/java/azkaban/executor/Status.java
azkaban-exec-server/src/main/java/azkaban/execapp/JobRunner.java
From @jamiesjc: When we kill the flow immediately after it starts in our integration test, it couldn't be killed due to some race condition. Users will see on the execution page that the job is in KILLING status but it actually never gets killed. And users cannot click kill button again during the KILLING period.
At the time when we kill, the process might not have started yet or the jobRunner has not yet been added to the activeJobRunners.
You can check more details in the PR description: #1289. We are still working on the fix.
The intention is to reintroduce this commit once the underlying bug is fixed.
|
7/31/2017 6:38:27 PM
logs every hour
We observed the long deletion transaction in our production Azkaban cluster. In mysql, if delete transaction is large (The data going to be deleted is too big), it occupies too much undo space (buffer), which affect other transactions and other databases in this host. So we change every day one time's deletion to every hour.
|
7/28/2017 9:46:55 PM
disk space (#1295)
The execution directory can bloat to be relatively large, which could lead to problems if the executor is shutdown and stale executions remain sitting on the executor occupying disk space.
This change deletes the execution directory on executor shutdown. The directory itself is created on executor initialization, so the directory isn't necessary to reset the executor later.
Note that the execution directory will not be deleted until AFTER all flows are done if the executor is brought down via shutdown() - so it will not delete partially-written logs before they are uploaded to the database.
|
7/28/2017 1:38:48 PM
execution cleanup (#1292)
Setting the gid bit on the group for a directory causes all items created within that directory to inherit the group of the directory. So all items users create in their /execution/<exec_id> directory will automatically be a part of the azkaban group. This allows the azkaban cleanup thread to properly remove user-generated files/directories.
The gid bit is system-specific, so there is no java standard library api for setting it. The solution I proposed spawns a subprocess that performs the chmod command. I don't think this is very clean, but I haven't been able to find a better way. Anybody have any ideas?
Another option would be to build this executionDirectory as part of our build process, but that isn't how we do things right now (tested by running deploy on holdem4 jenkins job without any of the other steps and verifying no /executions directory is created until system start).
Note that this change is not covered by testing due to the difficulty of dealing simultaneously with filesystems and subprocesses within a testing environment. If disk space usage increases in the future, it has possible that this change has been regressed. In order to confirm that this change is working in production on particularly large clusters (where problems have been seen), I'll be keeping an eye on it when it's released.
|
7/27/2017 10:46:17 PM
would reduce the memory overhead for session to some extend so that we can increase the size of session cache more.
|
7/27/2017 8:26:53 PM
Add metrics for sending email successs/failure.
* Guicify Emailer class and refactor some test cases.
|
7/27/2017 6:49:31 PM
in DB. (#1288)
|
7/25/2017 6:40:06 PM
to classes's annotations
This patch refactors the Guice uses, and mainly move all singleton
binding to respective classes with singleton annotations. The corresponding
tests are added as well. A bit more context is at #1285 .
|
7/25/2017 12:31:14 AM
type for killing a job
The action is to kill a job and retry it based on the retry configuration of that job.
Previously only killing a flow is allowed when SLA is missed even if SLA is set on job level.
New action will kick in when user sets SLA rule on a job and enforce kill action on missing the SLA. There's no UI change.
Testing it manually with following flows:
jobA(retry num: 2)->jobB(retry num: 2)
SLA rule: if job A doesn't succeed in 1 min, kill the job
SLA rule: if job B doesn't succeed in 1 min, kill the job
jobA->jobB, jobA->jobC, jobB->jobD, B retry number is set to 2.
SLA rule: if job B doesn't succeed in 1 min, kill it.
|
7/24/2017 2:22:05 PM
(#1279)
|