Adding a new tool
=================

1. Find or create a docker image
--------------------------------

Search for the tool and the ways it is distributed. Some tools already provide a docker image that can be used to run it. Sometimes, third parties will provide a docker image, but in practice these are not consistently updated, so be sure to check the version of the tool. Other tools will be available in a "generic" image containing multiple tools. If no suitable image can be found, create one yourself that wraps the tool, taking into account its dependencies and licenses.

**Example Dockerfile: MS Amanda**

If no suitable image is found, create one. For example, to package MS Amanda:

.. code-block:: docker

  FROM ubuntu:latest

  # Install dependencies
  RUN apt-get update && apt-get install -y \
      libc6-dev \
      libgcc1 \
      libgssapi-krb5-2 \
      libicu-dev \
      libssl-dev \
      libstdc++6 \
      zlib1g \
      && rm -rf /var/lib/apt/lists/*

  # Set working directory
  WORKDIR /msamanda

  # Copy MS Amanda binaries (e.g., MSAmanda, dependent files)
  COPY ./bin /msamanda

  # Make MS Amanda executable
  RUN chmod +x /msamanda/MSAmanda

  # Optional entrypoint
  ENTRYPOINT ["/msamanda/MSAmanda"]

This example assumes the MS Amanda binary and dependencies are placed in a local `bin/` directory. See:
https://github.com/Workflomics/tools-and-domains/tree/main/cwl-tools/ms_amanda/docker


2. Run the tool inside the container
------------------------------------

Chances are high that the tool or container has a specific way to be called, so here are some tips and tricks to try and get it running.

To try whether a tool works inside a docker container, you can run a docker container interactively. For instance:

.. code-block:: bash

  docker run -it --rm \
    --entrypoint /bin/bash \
    --mount=type=bind,source=/repos/containers/cwl-tools/Sage/test/data,target=/data/ \
    sage:latest 

This will start a bash shell inside the container, where you can try to run the tool. It mounts the local directory to /data/ inside the container, so put required input files and configuration there. Note that CWL runners will explicitly mount the input files, but mounting the directory can be useful for debugging.

Make sure the tool executes correctly and produces the expected output. Sometimes, additional config files will be required or arguments need to be passed in a specific way. If things are not working, consider the following:

- Check the version of the tool (usually with ``tool --version``).
- Check the help of the tool (usually with ``tool --help``)

It will be helpful to put the above command in a script, so you only need to figure out the correct command once. When the tool finally produces the expected output at the desired location, you are ready to go to the next step.


3. Create a semantically annotated CWL file
-------------------------------------------

.. important::

   The initial template for the CWL file can be generated from existing bio.tools annotations using the `APE` command line interface. See the `APE pull-a-tool <https://ape-framework.readthedocs.io/en/v2.4/docs/developers/cli.html#>`_ documentation for more information. The generated CWL file annotates the expected inputs and outputs and should be used as a starting point and modified to fit the specific tool version and requirements.

 Common Workflow Language (CWL) file formally describes how to run a computational tool. At minimum, a CWL file must clearly specify:

- **baseCommand**: The executable or command-line invocation.
- **inputs**: Files or parameters needed by the tool, including type and format.
- **outputs**: Result files produced by the tool, including type and retrieval method.

Additionally, CWL often specifies:

- **DockerRequirement**: A Docker image (`dockerPull`) containing the tool and its dependencies, and a directory for output (`dockerOutputDirectory`).
- **ShellCommandRequirement**: Enables complex shell commands within CWL (`valueFrom`).

In our approach, we extend basic CWL with semantic annotations to facilitate automated workflow composition. We incorporate the EDAM ontology to specify the computational purpose (`intent`), data types, and formats clearly and consistently. Specifically, we add:

- **intent**: EDAM `operation` terms (e.g., peptide identification) explicitly describing the tool's computational function.
- **Data types**: EDAM `data` terms next to each input/output, starting from a general root (`edam:data_0006`) refined into specific types.
- **Data formats**: EDAM `format` annotations precisely identifying file formats.

To keep annotations concise, we declare an EDAM namespace prefix under `$namespaces`.

This file can be automatically generated from the `bio.tools` annotations using the `APE pull-a-tool <https://ape-framework.readthedocs.io/en/v2.4/docs/developers/cli.html#>`_ command line interface (e.g, `java -jar APE-2.5.2-executable.jar pull-a-tool Sage-proteomics`). The generated file will contain the basic structure and annotations, which can then be modified to fit the specific tool version and requirements.

Here's a complete annotated example for the `Sage` tool, which performs peptide identification and retention time prediction:

.. code-block:: yaml

  cwlVersion: v1.2
  label: Sage-proteomics
  class: CommandLineTool
  baseCommand: ["/bin/bash", "-c"]
  arguments:
    - valueFrom: >
        "sage -o /data/output -f $(inputs.Sage_in_2.path) \
        $(inputs.Configuration.path) $(inputs.Sage_in_1.path) && \
        /data/sage_TSV_to_mzIdentML.sh /data/output/results.sage.tsv"
      shellQuote: false
  requirements:
    ShellCommandRequirement: {}
    DockerRequirement:
      dockerPull: workflomics/sage:latest
      dockerOutputDirectory: /data
    InitialWorkDirRequirement:
      listing:
        - class: File
          location: sage_TSV_to_mzIdentML.sh
          basename: sage_TSV_to_mzIdentML.sh

  $namespaces:
    edam: http://edamontology.org/

  intent:
    - http://edamontology.org/operation_3631  # Peptide identification
    - http://edamontology.org/operation_3633  # Retention time prediction
    - http://edamontology.org/operation_2428  # Validation

  inputs:
    Sage_in_1:
      type: File
      format: edam:format_3244  # mzML
      edam:data_0006: edam:data_0943  # Mass spectrum
    Sage_in_2:
      type: File
      format: edam:format_1929  # FASTA
      edam:data_0006: edam:data_2976  # Protein sequence

    Configuration:
      type: File
      format: edam:format_3464  # JSON
      default:
        class: File
        format: edam:format_3464  # JSON
        location: https://raw.githubusercontent.com/Workflomics/tools-and-domains/main/cwl-tools/Sage-proteomics/config.json

  outputs:
    Sage_out_1:
      type: File
      format: edam:format_3247  # mzIdentML
      edam:data_0006: edam:data_0945  # Peptide identification
      outputBinding:
        glob: /data/output/results.sage.mzid


The CWL file essentially describes one step from a workflow and we want to try whether it works as expected. The CWL file can be tested using the cwltool command line tool. For instance:


.. code-block:: bash

  cwltool --validate path/to/cwlfile.cwl


4. Set up automatic testing for the tool (optional, recommended)
-------------------------------------------------------------

After the tool has been successfully added and the CWL file created, it is recommended to add automated testing. To enable automated continuous integration (CI) testing via GitHub actions:

- Create a folder named ``test`` within your tool's directory.
- Inside this ``test`` folder, add two files:

  1. ``input.yml``: a YAML file specifying inputs for the CWL file.
  2. ``run-cwl.sh``: a bash script that executes the CWL tool with the provided inputs.

Example content of ``run-cwl.sh``:

.. code-block:: bash

  #!/bin/bash
  cwltool --outdir output ../your_tool.cwl ./input.yml

Replace ``your_tool.cwl`` with the name of your actual CWL file.

Once these files are in place, opening a pull request (PR) will trigger the GitHub Actions CI pipeline to run the provided test automatically, verifying the tool's functionality.


Adding a library as a tool
==========================

Sometimes a tool is not a standalone executable, but a library for a programming language. In this case, the tool can be wrapped in a script that calls the library. These can be R, Python, Java, or any other language. The script should be able to run the library with the correct arguments and produce the expected output. The script can be run in a docker container that contains the required library, environment, and dependencies. The CWL file should then call the script in the same way as a standalone executable.


Creating an R-based tool
-------------------------

1. Create the executable R script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since many R packages only provide function calls, write a simple R script(e.g., run_mytool.R)
that accepts command-line arguments(via commandArgs(trailingOnly = TRUE)) and then calls the package's functions. 
Set this script as an executable in the Dockerfile and optionally specify it under ENTRYPOINT.

2. Pick a base image
~~~~~~~~~~~~~~~~~~~~

A common choice is the rocker family(e.g., rocker/r-base:4.2.0), which ensures a functional 
R environment. 

First, we suggest finding containers in biocontainers or docker hub. If there is no container for 
your tool, creating a dockerfile is needed. In your Dockerfile, use ``apt-get install`` for 
system libraries(e.g., libxml2-dev) and ``R -e"install.packages(...)"`` or ``BiocManager::install(...)`` 
for R packages.

3. Test the tool
~~~~~~~~~~~~~~~~

Launch the container in interactive mode by ``docker run -it ...`` to ensure the R script 
runs correctly and that all libraries are installed. 

4. Write the CWL file
~~~~~~~~~~~~~~~~~~~~~

In the `` baseCommand``, refer to ["Rscript", "/path/to/run_script.R"]. Define your inputs 
and outputs according to the script's parameters.