WEBVTT

00:00.000 --> 00:16.000
So, this is my talk on Brown and which is a new CWL implementation that uses Geeks.

00:16.000 --> 00:19.000
CWL is the common workflow language.

00:19.000 --> 00:25.000
It is used in running scientific workflows on HP systems for data analysis.

00:25.000 --> 00:29.000
I will say more on that later.

00:29.000 --> 00:39.000
So, my problem is that I am an academic scientist and I have mountains of data to analyze.

00:39.000 --> 00:43.000
In my case, this is usually genomic data.

00:43.000 --> 00:56.000
And when I say mountains, it is easily some fraction of a terabyte or maybe even a few terabytes.

00:56.000 --> 01:04.000
So, the HPC, if you are not familiar, this still uses batch scheduler.

01:04.000 --> 01:12.000
So, you have a bunch of compute notes and you have some node on which you submit jobs.

01:12.000 --> 01:20.000
And there is a queue and it gets scheduled on the compute notes at some point of time when you get high enough into the priority.

01:20.000 --> 01:29.000
So, it is a very traditional, maybe even antiquated system of computation, but that is still the way it is done.

01:29.000 --> 01:37.000
And each of these jobs that you submit, each job is typically just a shell script, a batch script.

01:37.000 --> 01:42.000
And each of these jobs can take hours, maybe even days to compute.

01:42.000 --> 01:50.000
So, you really do not want to repeat those jobs if you can help it.

01:50.000 --> 01:59.000
So, some cashing should of these job steps is helpful to prevent re-computation.

01:59.000 --> 02:11.000
But the problem is that in the scientific workflows, you might want to tweak parameters that are input to the command set you are running.

02:12.000 --> 02:17.000
Or maybe you want to try a different version of the software tools that you are using.

02:17.000 --> 02:26.000
And or it could even be a more complicated problem where you were well, your tool is the same, but some library exchange or maybe the compiler exchange.

02:26.000 --> 02:40.000
So, it is hard to do perfect cashing where you can be sure that your cash is not still, that the results you are pulling out from the cash.

02:40.000 --> 02:43.000
This is actually the results that you want.

02:43.000 --> 02:58.000
So, people typically don't have any cash in, that is the, that is very common to, it is very common for people to stuff in all their parameters into the file names of their results.

02:58.000 --> 03:04.000
And, you know, things like final, final, any equal to one and then, usually there are many more parameters.

03:04.000 --> 03:12.000
So, stuff like 10 parameters into a file name and then it is very hard to do that consistently even if you are the only person doing the data analysis.

03:12.000 --> 03:17.000
So, this is a problem.

03:17.000 --> 03:29.000
So, about the common workflow language, I use something called the common workflow language to describe the jobs that I want to submit to the HPC.

03:30.000 --> 03:39.000
So, the common workflow language is one of many workflow languages, but some of the special features of the CWLR that it is, it is standards based.

03:39.000 --> 03:44.000
That is a YAML, it is a YAML specification.

03:44.000 --> 03:51.000
So, it is easy to, it is very machine readable and it is kind of easy to write a few.

03:51.000 --> 03:55.000
At least it is easy to generate from programming language.

03:55.000 --> 04:02.000
So, I can show you a typical workflow looks like.

04:02.000 --> 04:08.000
So, this is a really simple workflow for spell checking.

04:08.000 --> 04:12.000
I mean, it is not really a scientific workflow, it is just a tar example I am showing you.

04:12.000 --> 04:19.000
So, the blue boxes you see on the top, they are the inputs and the blue box at the end is output.

04:19.000 --> 04:26.000
And these are all the steps each one of these yellow blocks in between is one step of the computation.

04:26.000 --> 04:30.000
So, one step corresponds to one job that runs on the HPC.

04:30.000 --> 04:37.000
And these steps can be connected in any graph like form.

04:37.000 --> 04:43.000
So, it does not have to be linear, it can have these parallel paths that come together at some point.

04:43.000 --> 04:52.000
It is a DAG.

04:52.000 --> 05:00.000
Yeah, so here is a more realistic workflow, this is actually a workflow that I use it, you said work.

05:00.000 --> 05:05.000
So, and I will be showing a demo of this later today.

05:05.000 --> 05:12.000
So, it is very similar to the spell check workflow, it has it has inputs, outputs and steps in between.

05:12.000 --> 05:17.000
Just the tools that are used in between are actual bioinformatics tools.

05:17.000 --> 05:21.000
So, I will come back to the slides.

05:21.000 --> 05:27.000
So, the HPCWL is a standard based and it is a strongly typed language.

05:27.000 --> 05:33.000
Each one of the inputs and outputs that I showed you in the picture, there is a type attached to it.

05:33.000 --> 05:39.000
And each one of the steps are set and types of inputs and gives you certain types of outputs.

05:39.000 --> 05:49.000
So, you can type check all these, you can type check the entire workflow even before you execute it.

05:49.000 --> 05:56.000
So, that is CWL. So, how does Robin and which is my CWL implementation?

05:56.000 --> 06:01.000
How does it actually run CWL and how does it use geeks?

06:01.000 --> 06:07.000
So, I mentioned earlier that the cache-go-in-stale is a problem.

06:07.000 --> 06:11.000
And one good way to solve that is to use geeks.

06:11.000 --> 06:28.000
So, geeks lets you, so if a familiar geeks and so every binary in the store is attached to the long hash that that encodes all the dependencies that were used in building that binary.

06:28.000 --> 06:35.000
So, essentially that hash gives you a unique identifier for that tool.

06:35.000 --> 06:45.000
And in geeks you can use geeks expressions and a concert called program file to generate a unique script with the hash.

06:45.000 --> 06:53.000
That is, that has this specific, you know, scientific commands that you want to run.

06:53.000 --> 07:05.000
So, so, once you have that hash, you can build a cache which has those hashes and that cache is guaranteed to never go stale.

07:05.000 --> 07:10.000
So, let us get on to a demo and maybe things should become clearer.

07:10.000 --> 07:20.000
So, yeah, so I have this CWL file which is the workflow that I am going to execute.

07:20.000 --> 07:28.000
So, this is just the, this workflow, but I am showing you the actual YAML code that describes the workflow.

07:28.000 --> 07:33.000
You have inputs, outputs and then various steps.

07:33.000 --> 07:40.000
So, all that is just CWL and I don't want to go into the details of CWL right now.

07:41.000 --> 07:44.000
Then let us look at and this is an inputs file.

07:44.000 --> 07:53.000
So, I showed you in the picture that there are three inputs, chromosome sequences and threads.

07:53.000 --> 08:00.000
So, all those inputs are defined here as a simple JSON file.

08:00.000 --> 08:06.000
So, sequence is sequence labels, chromosome and so on.

08:06.000 --> 08:10.000
So, now let me actually execute it.

08:10.000 --> 08:13.000
So, let us time it as well.

08:13.000 --> 08:21.000
So, Ravnan, PNG nom.CWL and specify the inputs files.

08:21.000 --> 08:24.000
And it also takes a store argument.

08:24.000 --> 08:26.000
The store in this case is not the weak store.

08:26.000 --> 08:29.000
It is separate, it is basically the cache that Ravnan uses.

08:29.000 --> 08:32.000
I just borrowed the geeks terminology here.

08:32.000 --> 08:39.000
So, let us run this.

08:43.000 --> 08:48.000
So, you can see that it is executing the workflow step by step.

08:48.000 --> 08:52.000
And yeah.

08:52.000 --> 08:59.000
So, it says output is another JSON file which is JSON tree.

09:00.000 --> 09:07.000
Which contains the output file in this case, which is dot path file, which is a sequence thing.

09:07.000 --> 09:10.000
By informatics sequence thing.

09:10.000 --> 09:13.000
So, it took 30 seconds to run this.

09:13.000 --> 09:17.000
This is a, I made the sequence really short.

09:17.000 --> 09:19.000
So, I could show you a demo.

09:19.000 --> 09:21.000
So, let us run it again.

09:21.000 --> 09:28.000
And now this time it should just pull straight out of the cache and without having to recompute.

09:28.000 --> 09:33.000
Yeah, 8 seconds.

09:33.000 --> 09:40.000
So, you can see here that it says that this script has been previously run.

09:40.000 --> 09:43.000
Just retrieving the cell from the store.

09:43.000 --> 09:45.000
And you have the same output.

09:45.000 --> 09:49.000
So, that is basically all there is to it.

09:49.000 --> 09:55.000
It is a matter of combining a CWL with geeks to get this to get a perfect cache.

09:55.000 --> 10:02.000
And now each time you want to generate those results.

10:02.000 --> 10:06.000
You don't have to remember what parameters you put into the workflow.

10:06.000 --> 10:11.000
You can just, you know, ask the system to produce the results.

10:11.000 --> 10:14.000
And if it is already done, it will just keep to you.

10:14.000 --> 10:17.000
If it is not it will be completed for you.

10:17.000 --> 10:20.000
So, you may be familiar with this problem.

10:20.000 --> 10:28.000
And using make, like make is all nice and good.

10:28.000 --> 10:33.000
But if you change the make file or if you change something else, you might have to do a make clean

10:33.000 --> 10:35.000
and then rebuild everything all over again.

10:35.000 --> 10:38.000
So, essentially defeating the purpose of make.

10:38.000 --> 10:43.000
So, this is just like that, but it is like a perfect make.

10:43.000 --> 10:45.000
So, that is it.

10:45.000 --> 10:47.000
Thank you very much.

10:48.000 --> 11:00.000
So, any questions?

11:00.000 --> 11:01.000
Yeah.

11:01.000 --> 11:08.000
How does it relate to the CWL language like in the new project like,

11:08.000 --> 11:11.000
which is an official page about CWL?

11:11.000 --> 11:14.000
So, CWL is a standard and that is separate.

11:14.000 --> 11:18.000
It is got nothing to do with geeks or ignore or anything else.

11:18.000 --> 11:20.000
It is a standard workflow language.

11:20.000 --> 11:22.000
I just implemented the spaceification.

11:22.000 --> 11:23.000
Yeah.

11:23.000 --> 11:24.000
And how does it compare to?

11:24.000 --> 11:30.000
Because there is a, then the geeks project for the CWL implementation.

11:30.000 --> 11:34.000
So, you mean other implementations of CWL, right?

11:34.000 --> 11:35.000
Like CWL too.

11:35.000 --> 11:36.000
Yeah.

11:36.000 --> 11:38.000
There are several CWL implementations.

11:38.000 --> 11:40.000
CWL is a reference implementation.

11:40.000 --> 11:43.000
Then there is a style and a reward dose and so on.

11:43.000 --> 11:45.000
But none of them have a perfect cache.

11:45.000 --> 11:50.000
So, CWL gives you some caching, but, but if you, if you like,

11:50.000 --> 11:54.000
all of the workflow in some way, you essentially have to throw the entire cache and

11:54.000 --> 11:55.000
Recomplete everything.

11:55.000 --> 11:56.000
Yeah.

11:56.000 --> 11:57.000
Questions?

11:57.000 --> 12:00.000
I suppose you are going to be integrating this with things,

12:00.000 --> 12:03.000
which hold dates from other sources.

12:04.000 --> 12:07.000
You have your geeks forged projects.

12:07.000 --> 12:14.000
Are you sort of looking at ways of collecting dates and all of the transmitting, right?

12:14.000 --> 12:19.000
Oh, geeks for this, it is a very, almost time to listen.

12:19.000 --> 12:21.000
15 minutes.

12:24.000 --> 12:25.000
Okay.

12:25.000 --> 12:26.000
I can show you CWL.

12:26.000 --> 12:27.000
Yeah.

12:27.000 --> 12:29.000
But that is a different topic.

12:33.000 --> 12:42.000
Yeah, this was meant to be a very quick lightning talk.

12:48.000 --> 12:49.000
Yeah.

12:49.000 --> 12:55.000
So, I was saying that it is easy to generate Yamal from any programming language.

12:55.000 --> 13:02.000
And so, the CWL workflows are supposed to be written as a

13:02.000 --> 13:04.000
CML files and I really don't like writing Yamal files.

13:04.000 --> 13:07.000
Even though something will do like it.

13:07.000 --> 13:09.000
I don't know.

13:09.000 --> 13:15.000
So, I have a different project, a different, so I built this domain

13:15.000 --> 13:20.000
specific language that generates the Yamal files.

13:20.000 --> 13:24.000
So, this is where I built on Gile.

13:24.000 --> 13:28.000
So, it looks very schemy with expressions and all that.

13:28.000 --> 13:38.000
And I describe the inputs, the command that is running, which is this tool called

13:38.000 --> 13:39.000
Samtools.

13:39.000 --> 13:43.000
And then there is an output file.

13:43.000 --> 13:48.000
And in the other field, I specify the packages that are required for that step.

13:48.000 --> 13:53.000
So, packages in this case are the geeks packages.

13:53.000 --> 13:57.000
So, you have a bunch of steps or commands like that.

13:57.000 --> 14:02.000
And then they are all connected together in a workflow description, which is what you see

14:02.000 --> 14:03.000
then.

14:03.000 --> 14:06.000
So, I used to both them together.

14:06.000 --> 14:08.000
But it is really up to you.

14:08.000 --> 14:12.000
If you really like writing Yamal, you can just write Yamal and use

14:12.000 --> 14:15.000
your own and then the geeks part.

14:15.000 --> 14:21.000
So, there are actually people who like writing Yamal, which is a bit strange, yeah.

14:21.000 --> 14:27.000
Yeah, it's starting to have to cover 15 minutes, so.

14:27.000 --> 14:33.000
There are no questions we could just end earlier.

14:33.000 --> 14:38.000
Yeah, you have a full CCWL talk online also, right?

14:38.000 --> 14:39.000
Yeah, yeah.

14:39.000 --> 14:41.000
I have a CCWL talk in the first time.

14:41.000 --> 14:42.000
Yeah.

14:42.000 --> 14:43.000
Yeah.

14:43.000 --> 14:46.000
I have just thought about the question.

14:46.000 --> 14:48.000
But the question about that was from Murray.

14:48.000 --> 14:50.000
Your link, why do you want to do this?

14:50.000 --> 14:51.000
Sorry.

14:51.000 --> 14:53.000
What kind of computer?

14:53.000 --> 14:54.000
Sorry.

14:54.000 --> 15:01.000
I was going to ask a bit of the big questions.

15:01.000 --> 15:03.000
But I was from Murray.

15:03.000 --> 15:06.000
Do you do by your computing in your job?

15:07.000 --> 15:08.000
Sorry.

15:08.000 --> 15:10.000
Is it just examples for...

15:10.000 --> 15:11.000
Yeah.

15:11.000 --> 15:13.000
I do by your mathematics in my day job.

15:13.000 --> 15:14.000
Okay.

15:14.000 --> 15:15.000
So, do you know my data?

15:15.000 --> 15:16.000
Okay.

15:16.000 --> 15:17.000
Yeah.

15:17.000 --> 15:18.000
Thank you.

15:21.000 --> 15:22.000
Yes.

15:22.000 --> 15:23.000
It's too stretch, right?

15:23.000 --> 15:24.000
Yeah, yeah.

15:24.000 --> 15:31.000
I can't imagine running jobs on HPC without Geeks or without a good cash.

15:31.000 --> 15:36.000
Because usually, by the time you get to the point of writing a paper, it's been months

15:36.000 --> 15:37.000
or maybe even years.

15:37.000 --> 15:40.000
And you don't really remember what it did a long time ago.

15:40.000 --> 15:44.000
And it's, and keeping notes doesn't really work that well.

15:44.000 --> 15:45.000
Yeah.

15:45.000 --> 15:58.000
I wanted to talk more about the controversial, not controversial.

15:58.000 --> 16:00.000
You said at the end, this is just like a better make.

16:00.000 --> 16:04.000
You mean, would you really get rid of make, if you could, to replace it with, like...

16:04.000 --> 16:08.000
Geeks or RNs set up is actually based on hashes.

16:08.000 --> 16:09.000
Yeah.

16:09.000 --> 16:13.000
So, you can't have these sort of conflicts when running on a thousand notes.

16:13.000 --> 16:16.000
And within one second into false.

16:16.000 --> 16:20.000
And people have drops, make make for the reason, because you get all types of notes,

16:20.000 --> 16:23.000
not deterministic behavior.

16:23.000 --> 16:27.000
Yeah, I've seen notes on HPC, which have like,

16:28.000 --> 16:31.000
sometimes people forget to set up your NTPDments on them.

16:31.000 --> 16:35.000
And the time is like 10 minutes, 15 minutes off.

16:35.000 --> 16:37.000
And then strange things happen.

16:37.000 --> 16:38.000
Yeah.

16:43.000 --> 16:45.000
So, you have a question?

16:45.000 --> 16:46.000
No.

16:46.000 --> 16:48.000
You look like you have a question.

16:51.000 --> 16:53.000
Excellent. Thank you, everyone.

16:53.000 --> 16:54.000
Thank you, everyone.

16:54.000 --> 16:55.000
Thank you, everyone.

