WEBVTT

00:00.000 --> 00:28.000
So, this is my truck on Brown and which is a new CWL implementation that is that uses geeks.

00:28.000 --> 00:31.000
CWL is the common workflow language.

00:31.000 --> 00:38.000
It is used in running scientific workflows on HPC systems for data analysis.

00:38.000 --> 00:41.000
I will say more on that later.

00:41.000 --> 00:51.000
So, my problem is that I am an academic scientist and I have mountains of data to analyze.

00:52.000 --> 01:09.000
In my case, this is usually genomic data and when I say mountains, it is easily some fraction of a terabyte or maybe even a few terabytes.

01:09.000 --> 01:17.000
So, the HPC, if you are not familiar, this still uses batch scheduler.

01:17.000 --> 01:24.000
So, you have a bunch of compute notes and you have some node on which you submit jobs.

01:24.000 --> 01:32.000
And there is a queue and it gets scheduled on the compute notes at some point of time when you get high enough into the priority.

01:32.000 --> 01:42.000
So, it is a very traditional maybe even antiquated system of computation, but that is still the way it is done.

01:42.000 --> 01:50.000
And each of these jobs that you submit, each job is typically just a shell script, a batch script.

01:50.000 --> 01:55.000
And each of these jobs can take hours, maybe even days to compute.

01:55.000 --> 02:03.000
So, you really do not want to repeat those jobs if you can help it.

02:03.000 --> 02:12.000
So, some cashing should of these jobs steps is helpful to prevent recomputation.

02:12.000 --> 02:24.000
But the problem is that in the scientific workflows, you might want to tweak parameters that are input to the command set you are running.

02:24.000 --> 02:29.000
Or maybe you want to try a different version of the software tools that you are using.

02:29.000 --> 02:38.000
And or it could even be a more complicated problem where your tool is the same, but some library exchange or maybe the compiler exchange.

02:38.000 --> 02:49.000
So, it is hard to do perfect cashing where you can be sure that your cash is not still.

02:49.000 --> 02:56.000
And that the results you are pulling out from the cash is actually the results that you want.

02:56.000 --> 03:04.000
So, people typically don't have any cash in, that is the, that is very common to.

03:04.000 --> 03:11.000
It is very common for people to stuff in all their parameters into the file names of their results.

03:11.000 --> 03:17.000
And you know things like final, final, any equal to one, and then usually there are many more parameters.

03:17.000 --> 03:25.000
So, stuff like 10 parameters into a file name, and then it is very hard to do that consistently, even if you are the only person doing the data analysis.

03:25.000 --> 03:30.000
So, this is a problem.

03:30.000 --> 03:42.000
So, about the common workflow language, I use something called the common workflow language to describe the jobs that I want to submit to the HPC.

03:42.000 --> 03:47.000
So, the common workflow language is one of many workflow languages.

03:47.000 --> 03:52.000
But some of the special features of CWLR that it is, it is standards based.

03:52.000 --> 03:57.000
That is a YAML, it is a YAML specification.

03:57.000 --> 04:03.000
So, it is easy to, it is very machine readable and it is kind of easy to write to few.

04:03.000 --> 04:08.000
At least it is easy to generate from programming language.

04:08.000 --> 04:15.000
So, I can show you a typical workflow looks like.

04:15.000 --> 04:21.000
So, this is a really simple workflow for spell checking.

04:21.000 --> 04:25.000
I mean it is not really a scientific workflow, it is just a toy example I am showing you.

04:25.000 --> 04:29.000
So, that there are the blue boxes you see on the top, they are the inputs.

04:30.000 --> 04:33.000
And the blue box at the end is the output.

04:33.000 --> 04:39.000
And these are all the steps each one of these yellow blocks in between is one step of the computation.

04:39.000 --> 04:43.000
So, one step corresponds to one job that runs on the HPC.

04:43.000 --> 04:50.000
And these steps can be connected in any graph like form.

04:50.000 --> 04:56.000
So, it does not have to be linear, it can have these parallel paths that come together at some point.

04:56.000 --> 05:06.000
So, it is a DAG.

05:06.000 --> 05:10.000
So, here is a more realistic workflow.

05:10.000 --> 05:14.000
This is actually a workflow that I use at work.

05:14.000 --> 05:18.000
So, and I will be showing a demo of this later today.

05:18.000 --> 05:22.000
So, it is very similar to the spell check workflow.

05:22.000 --> 05:25.000
It has inputs, outputs and steps in between.

05:25.000 --> 05:30.000
Just the tools that are used in between are actual bioinformatics tools.

05:30.000 --> 05:34.000
So, back to the slides.

05:34.000 --> 05:40.000
So, repeat CWL is a standard based and it is a strongly typed language.

05:40.000 --> 05:44.000
Each one of the inputs and outputs that I showed you in the picture.

05:44.000 --> 05:46.000
There is a type attached to it.

05:46.000 --> 05:52.000
And each one of the steps are set and types of inputs and gives you certain types of outputs.

05:52.000 --> 05:56.000
So, you can type check all these.

05:56.000 --> 06:02.000
You can type check the entire workflow even before you execute it.

06:02.000 --> 06:04.000
So, that is CWL.

06:04.000 --> 06:09.000
So, how does Robin and which is my CWL implementation?

06:09.000 --> 06:11.000
How does it actually run CWL?

06:11.000 --> 06:14.000
And how does it use geeks?

06:14.000 --> 06:19.000
So, I mentioned earlier that the cache do in stale is a problem.

06:19.000 --> 06:23.000
And one good way to solve that is to use geeks.

06:23.000 --> 06:26.000
So, geeks lets you.

06:26.000 --> 06:36.000
So, if a familiar geeks and say every binary in the store is attached to the long hash that

06:36.000 --> 06:40.000
that encodes all the dependencies that were used in building that binary.

06:40.000 --> 06:47.000
So, essentially that hash gives you a unique identifier for that tool.

06:48.000 --> 06:54.000
And in geeks you can use geeks expressions and a constant called program file.

06:54.000 --> 06:59.000
To generate a unique script with the hash.

06:59.000 --> 07:06.000
That is that has this specific, you know, scientific commands that you want to run.

07:06.000 --> 07:14.000
So, once you have that hash, you can build a cache which has those hashes.

07:14.000 --> 07:18.000
And that cache is guaranteed to never go stale.

07:18.000 --> 07:23.000
So, let us get on to a demo and maybe things should become clear.

07:23.000 --> 07:27.000
So, yeah.

07:27.000 --> 07:30.000
So, I have this CWL file.

07:30.000 --> 07:33.000
It is the workflow that I am going to execute.

07:33.000 --> 07:37.000
So, this is just the this workflow.

07:37.000 --> 07:41.000
But, I am showing you the actual YAML code that describes the workflow.

07:41.000 --> 07:45.000
You have inputs, outputs and then various steps.

07:45.000 --> 07:53.000
So, all that is just CWL and I don't want to go into details of CWL right now.

07:53.000 --> 07:56.000
Then let us look at and this is an inputs file.

07:56.000 --> 08:05.000
So, I showed you in the picture that there are three inputs, chromosome sequences and threats.

08:05.000 --> 08:12.000
So, all those inputs are defined here as a simple JSON file.

08:12.000 --> 08:19.000
So, sequences, sequence labels, chromosome and so on.

08:19.000 --> 08:23.000
So, now let me actually execute it.

08:23.000 --> 08:26.000
So, let us time it as well.

08:26.000 --> 08:34.000
So, Ravnan, PNGNome.CWL and specify the inputs files.

08:34.000 --> 08:37.000
And it also takes a store argument.

08:37.000 --> 08:39.000
The store in this case is not the Geek store.

08:39.000 --> 08:40.000
It is separate.

08:40.000 --> 08:42.000
It is basically the cache that Ravnan uses.

08:42.000 --> 08:46.000
I just borrowed the Geek's terminology here.

08:46.000 --> 08:56.000
So, let us run this.

08:56.000 --> 09:01.000
So, you can see that it is executing the workflow step by step.

09:01.000 --> 09:16.000
So, it says output is another JSON file which is a JSON tree which contains the output file

09:16.000 --> 09:21.000
in this case which is dot path file which is a sequence thing.

09:21.000 --> 09:23.000
By informatics sequence thing.

09:23.000 --> 09:26.000
So, it took 30 seconds to run this.

09:26.000 --> 09:30.000
This is a, I made the sequence really short.

09:30.000 --> 09:32.000
So, I could show you a demo.

09:32.000 --> 09:34.000
So, let us run it again.

09:34.000 --> 09:44.000
And now this time it should just pull straight out of the cache and without having to recompute.

09:44.000 --> 09:46.000
Yeah, 8 seconds.

09:46.000 --> 09:53.000
So, you can see here that it says that this script has been previously run.

09:53.000 --> 09:56.000
Just retrieving the cell from the store.

09:56.000 --> 09:59.000
And you have the same output.

09:59.000 --> 10:02.000
So, that is basically all there is to it.

10:02.000 --> 10:08.000
It is a matter of combining a CWL with Geeks to get this to get a perfect cache.

10:08.000 --> 10:15.000
And now each time you want to generate those results.

10:15.000 --> 10:19.000
You don't have to remember what parameters you put into the workflow.

10:20.000 --> 10:25.000
You can just ask the system to produce the results.

10:25.000 --> 10:28.000
And if it is already done, it will just keep to you.

10:28.000 --> 10:30.000
If it is not it will be completed for you.

10:30.000 --> 10:36.000
So, you may be familiar with this problem and using make.

10:36.000 --> 10:41.000
Like make is all nice and good.

10:41.000 --> 10:47.000
But if you change the make file or if you change something else, you might have to do a make clean

10:47.000 --> 10:49.000
and rebuild everything all over again.

10:49.000 --> 10:51.000
So, essentially defeating the purpose of make.

10:51.000 --> 10:56.000
So, this is just like that, but it is like a perfect make.

10:56.000 --> 10:58.000
So, that is it.

10:58.000 --> 11:00.000
Thank you very much.

11:01.000 --> 11:14.000
So, any questions?

11:14.000 --> 11:24.000
How does it relate to the CWL language like in the new project like person official page about CWL?

11:24.000 --> 11:27.000
So, CWL is a standard and that is separate.

11:27.000 --> 11:31.000
It is got nothing to do with Geeks or GNU or anything else.

11:31.000 --> 11:33.000
It is a standard workflow language.

11:33.000 --> 11:35.000
I just implemented the specification.

11:35.000 --> 11:37.000
And how does it compare to?

11:37.000 --> 11:43.000
Because there is a specific Geeks project for the CWL implementation.

11:43.000 --> 11:47.000
So, I mean other implementations of CWL.

11:47.000 --> 11:48.000
Like CWL too.

11:48.000 --> 11:51.000
Yeah, there are several CWL implementations.

11:51.000 --> 11:53.000
CWL is a reference implementation.

11:53.000 --> 11:56.000
Then there is a style and a reward version so on.

11:56.000 --> 11:58.000
But none of them have a perfect cache.

11:58.000 --> 12:05.000
So, CWL tool gives you some caching, but if you like all the work flow in some way,

12:05.000 --> 12:09.000
you essentially have to throw the entire cache and re-complut everything.

12:09.000 --> 12:10.000
Yes, sir?

12:10.000 --> 12:16.000
I suppose you are going to be integrating this with things which hold dates from other sources.

12:17.000 --> 12:20.000
You have your Geeks or projects.

12:20.000 --> 12:26.000
Are you sort of looking at ways of collecting dates or transmitting it?

12:26.000 --> 12:27.000
Yes, sir.

12:27.000 --> 12:29.000
Geeks for this, it is a very...

12:29.000 --> 12:31.000
Almost time did you say?

12:31.000 --> 12:33.000
Yes, 15 minutes.

12:33.000 --> 12:34.000
15 minutes.

12:37.000 --> 12:39.000
Okay, I can show you CWL.

12:39.000 --> 12:42.000
Yeah, but that is a different topic.

12:46.000 --> 13:01.000
Yeah, this was meant to be a very quick lightning talk.

13:01.000 --> 13:12.000
Yeah, so I was saying that it is easy to generate Yamal from any programming language.

13:12.000 --> 13:18.000
So, CWL are supposed to be written as Yamal files and I really don't like writing Yamal files,

13:18.000 --> 13:22.000
even though something would do like it.

13:22.000 --> 13:27.000
So, I have a different project, a different...

13:27.000 --> 13:34.000
So, I built this domain-specific language that generates the Yamal files.

13:34.000 --> 13:37.000
So, this is where built on Gile.

13:37.000 --> 13:42.000
So, it looks very schemy with expressions and all that.

13:42.000 --> 13:52.000
And I describe the inputs, the command that is running, which is this tool called samtools.

13:52.000 --> 13:56.000
And then there is an output file.

13:56.000 --> 14:01.000
And in the other field, I specify the packages that are required for that step.

14:01.000 --> 14:06.000
So, packages in this case are the Geeks packages.

14:06.000 --> 14:10.000
So, you have a bunch of steps or commands like that.

14:10.000 --> 14:16.000
And then they are all connected together in a workflow description, which is what you see then.

14:16.000 --> 14:21.000
So, I use the Buddhism together, but it is really up to you.

14:21.000 --> 14:27.000
If you really like writing Yamal, you can just write Yamal and use only the Ramon and the Geeks part.

14:27.000 --> 14:33.000
So, there are actually people who like writing Yamal, which is a bit strange, yeah.

14:37.000 --> 14:40.000
Yeah, it's starting to have to cover 15 minutes.

14:44.000 --> 14:47.000
Yeah, there are no questions we could just end earlier.

14:47.000 --> 14:51.000
Yeah, you have a full CCWL talk online also, right?

14:51.000 --> 14:52.000
Yeah, yeah.

14:52.000 --> 14:54.000
Yeah, they have a CCWL talk in first time.

14:54.000 --> 14:55.000
Yeah.

14:55.000 --> 14:56.000
Yeah.

14:56.000 --> 15:01.000
Just a little bit of a question, but that wasn't really your blinker.

15:02.000 --> 15:06.000
Sorry, what kind of computer?

15:10.000 --> 15:15.000
Sorry, I was going to ask a bit of a topic question, but I was wondering,

15:15.000 --> 15:19.000
do you do a bio-computing in your job?

15:19.000 --> 15:23.000
So, are you easy to adjust the examples for...

15:23.000 --> 15:26.000
Yeah, I do bio-informatics in my dead job.

15:26.000 --> 15:27.000
Okay.

15:27.000 --> 15:28.000
So, geomic data.

15:28.000 --> 15:29.000
Okay.

15:29.000 --> 15:30.000
Yeah.

15:30.000 --> 15:33.000
Thank you.

15:33.000 --> 15:34.000
Yes.

15:34.000 --> 15:36.000
So, um, has this H2 stretch, right?

15:36.000 --> 15:37.000
Yeah, yeah.

15:37.000 --> 15:42.000
It's, I mean, I can't imagine running jobs on HPC without Geeks,

15:42.000 --> 15:45.000
or without a good cash, because, because usually,

15:45.000 --> 15:48.000
by the time you get to the point of writing a paper,

15:48.000 --> 15:50.000
it's been months or maybe even years,

15:50.000 --> 15:53.000
and you don't really remember what it did a long time ago.

15:53.000 --> 15:57.000
So, and it's, and keeping notes doesn't really work that well.

15:57.000 --> 16:00.000
So, yeah.

16:06.000 --> 16:11.000
I wanted to talk more about the controversial, not controversial.

16:11.000 --> 16:14.000
You said at the end, this is just like a better make.

16:14.000 --> 16:15.000
Do you mean it?

16:15.000 --> 16:18.000
Would you really get rid of make if you could to replace it with, like,

16:18.000 --> 16:20.000
Geeks or R&C?

