High-throughput technology has contributed to the large-scale studies on the characterization of populations of biological entities [1]. A variety of "-omics" disciplines, such as genomics [2], transcriptomics [3], proteomics [4] and metabolomics [5, 6], have begun to emerge, with their own sets of instruments, techniques, reagents and software. The characterization of the "-omes" produces huge amount of data that would be impossible to process without Information Technology. The work of life scientists is also rapidly changing. Now a researcher deals not only with laboratory equipment and in vitro experiments but also with software and web resources, i.e. in silico experiments. Scientific protocols include a very broad spectrum of activities (whether manual or automated) to be executed at the work bench and/or on the computer. Computers play a central role in data production, collection, storage, hypothesis formation and experimentation [7]. Several sectors of science are becoming largely automated [8] and this aspect has been highlighted by the emergence of "e-Science" [9]. However, to reap the benefits of computers and consequently of automation, it is essential that scientists change the way in which scientific knowledge is described, reported and finally stored. In fact, two of the problems in contemporary life science research are the interpretation and the reproducibility of published experimental results. Hence there is urgent need for a formal representation of scientific knowledge, including procedures (e.g., laboratory protocols, bioinformatic workflows).
Laboratory protocols and experimental methodologies are indeed an integral part of research in life sciences. The way in which protocols are described is decisive in permitting the reproducibility and the successful replication of experiments. Normally, the detailed notes about the kind of experimental procedures and their order, the type of materials and the variety of methods used by a researcher are available only inside his research group or department. The information is then disseminated through the research community by scientific publications and as a consequence it becomes available for the use of scientists who are new to that topic. Every individual study rests on ad-hoc laboratory protocols, these are usually included in a "Materials and Methods" which are defined only in natural languages. This way of describing laboratory processes has many limitations for their repeatability, distribution and more importantly automation. This can lead to ambiguous statements and to vastly arbitrary interpretations. Textual representation is the best choice for readability but it does not promote the re-use of parts of the protocol description and does not give a global, structured vision of the whole process as well as not highlighting the numerous resources necessary for the execution of the experiment.
A researcher can spend weeks or even months to learn, set up, and apply new experimental techniques or protocols. Thus, a significant amount of time in the laboratory is spent learning techniques and procedures mainly published by other research groups. This is a never ending process for experimental life scientists since methodologies and their respective protocols are evolving at a dramatic pace. Moreover laboratory automation is becoming increasingly crucial in many fields of experimental research. In fact, many wet-lab activities are becoming dependent on laboratory robots [10]. Bioinformatics encompasses automation in all the aspects related to biological data, including data collection, management and analysis. Two levels of formalization are required: one for the entities and operations deployed in protocols and another for the protocols themselves that can combine manually executed and automated procedures.
Ontology is one of the strategies for the structured and formalized representation of a chosen knowledge domain domain in a formal way, helping to remove ambiguity and redundancy, detecting errors and allowing automated reasoning. Ontologies describe the entities of the specific domain but do not specify how these entities should be used and combined.
Workflows can do this job. A workflow is a representation of a sequence of operations, declared as the work of a person, a group of persons, or machines. Workflows permit the description and the orchestration of complex processes in a visual form, capturing human-to-machine interactions within those processes. Several disciplines adopt workflows systems for the automation of data processing through a series of processing stages.
In this paper we propose a method for the formal representation of biological laboratory protocols that combines the unambiguous semantic of ontologies with the expressive power of workflows. Based on this approach, we have developed COW (Combining Ontology and Workflow), an add-on for the workflow editor JPEd [11] to design laboratory protocols, that integrates both ontologies and workflows. The software allows designers of protocol-workflows to select concepts from a domain specific ontology and to include them in their workflows.
Laboratory protocols
Several on-line resources are available for retrieving information about life-science protocols and experiments. Since 1997, the Science Advisory Board (SAB) [12] has been working to improve communications between biomedical scientists and suppliers of laboratory products and services. SAB also maintain an extensive database of protocols divided by techniques.
Protocol-Online [13] appeared in 1999 on the web as a database resource for research protocols in a variety of life science fields such as cell biology, molecular biology, developmental biology, and immunology.
In 2004, the Nature Publishing Group (NPG) launched Nature Methods [14], a monthly research journal on novel methods and significant improvements to laboratory techniques in the life sciences and related areas of chemistry. In addition, Nature Methods includes a Protocols section describing established methods written using 'bench terms'.
In 2006, JoVE [15] started to publish on-line video-protocols. The user is not required to read through a written protocol but can simply watch a video. Each video article includes step-by-step instructions for an experiment, a demonstration of equipment and reagents, and a brief discussion, with experts describing possible technical problems and modifications [16].
In the same year, Nature Protocols [17], became available as a cutting-edge on-line journal for biological and biomedical protocols. Protocols, written in natural language, are organized into logical categories in order to be easily accessible to researchers. They are presented in a 'recipe' style providing step-by-step descriptions of procedures that users can take to the lab bench and immediately apply to their own research.
As an example of a protocol for in silico experiments, Huang et al. [18] describe how to use the DAVID bioinformatic resources for the analysis of large gene lists derived from high-throughput genomic experiments, including how DAVID modules are able to help users to extract biological meaning from the given gene list and how individual modules should be used either independently or jointly. The reader can find the procedure easier to follow to reproduce the study.
The approach used for describing a computational procedure is also adopted for laboratory protocols. For instance, the protocol suggested by Fiegler [19] is organized into several sections; first, a list of materials used in the experiment including equipment, materials and their set up is provided. The second section is a step-by-step description of the methodology used. Critical steps that must be performed in a very precise manner and all toxic or harmful chemicals are highlighted. These warnings are tagged by the heading
Critical step and Caution
Unlike the articles in the previously cited journals, in Nature Protocols the author is also asked to report the timing and possible troubleshooting in order to give an idea of the duration of the procedure and on how to troubleshoot the most likely problems. Writing protocols using the same pre-defined template will help to understand the procedure, as well as the critical steps and implementation of the technique reported in the published study.
In laboratory protocols there are numerous examples of ambiguous sentences. In fact statements that can be interpreted in different ways can introduce uncertainty as to how the procedure should be performed. For example the instruction "Remove the supernatant and dry the precipitated DNA briefly before washing with 100 μl of 70% ethanol" introduces an ambiguity of the term "briefly", which may indicate different lengths of time. It could mean 30 seconds, 5 minutes, 10 minutes or a longer time. The term "gentle" in the instruction "Transfer slides into a solution of 0.1% sodium dodecyl sulphate and incubate for 5 min with gentle shaking." can be arbitrarily interpreted. This problem could be overcome by providing a single value or a range of admissible values, depending on the activity performed, which can help reduce the ambiguity in the meaning of the term.
Finally, the writing style of Nature Protocols is not intended to facilitate the automation of procedures. A computer machine will not be able to read it, interpret it and then replicate the original experiment.
Ontologies
The need to unambiguously classify the huge amount of data available as well as precisely define their semantic relationship has increased the need for formal knowledge representation. In the 1980's, the ontologies entered the computer science field as a way to provide a simplified and well-defined description of a specific domain or an area of interest. An ontology defines "a set of representational primitives with which to model a domain of knowledge or discourse" [20]. Ontologies provide a common shared vocabulary to model a domain, defining the types of objects and concepts that exist with their properties and relationships. Ontology can be classified according to the subject of conceptualization into [21]:
-
1.
general or common ontologies, defining concepts to represent common sense knowledge, reusable across domains;
-
2.
top-level ontologies, defining very general concepts independent of a particular domain such as space, time, object, event, etc., and providing general notions from which all root terms in existing ontologies should be related;
-
3.
domain ontologies, defining concepts within a specific domain and their relationships; the concepts in this type of ontology are usually the specialization of concepts already defined in a top-level ontology;
-
4.
task ontologies, defining concepts related to the execution of a particular task or activity and providing a vocabulary of terms used to solve problems associated with task that may or may not belong to the same domain;
-
5.
application ontologies, containing all the definitions needed to model the knowledge required for a particular application.
Recently, we have seen an explosion of interest in ontologies as models to represent human knowledge. Ontologies are now extensively used in applications related to areas such as knowledge management, natural language processing, e-commerce [22], web services [23], intelligent information integration, bioinformatics [24], education, life sciences [25] and medicine [26], and in widely adopted technologies such as the Semantic Web [27]. There are several reasons for this large scenario of applications. Ontologies provide a common terminology, over a domain, necessary for communication between people and organizations and also provide the basis for interoperability between systems. They can be used for making the content in information sources explicit and serve as an index to a repository of information [28]. The growing interest in ontologies, triggered the development of Ontological Engineering, a novel field concerned with the ontology development process, the ontology life cycle, the methods and methodologies for building ontologies, and the tool suites and languages that support them [29, 30].
Despite the cited advantages, the choice of ontologies and formal representations incurs considerable costs for the retooling and upgrade of resources, and for the training of ontology developers. One serious problem is that differing ontologies may be developed and applied for the representation of the same domain. However, the mere use of ontology obviously does not warrant the elimination of heterogeneity; instead it can raise heterogeneity problems to a higher level. Ontology alignment, or ontology matching [31], a process that determines correspondences between concepts in different ontologies, can help to overcome those problems. In biology the heterogeneity of ontologies represents an emergent issue. In this respect, the OBO Foundry initiative [32] engages developers of science-based ontologies in the pursuit of a set of common principles for ontology development, with the goal of creating a suite of orthogonal interoperable reference ontologies in the biomedical domain.
The use of the word ontology within biology is relatively recent. Initially, computer scientists recognized in biological data a domain in which ontologies were needed in order to solve problems of heterogeneity. The second phase saw the adoption of bio-ontology by the biological community itself as a mean to consistently annotate different features, from genotype (e.g nucleotide sequences, proteins) to phenotype (e.g. diseases) [33]. Later, with the beginning of genome-scale sequencing projects and the diffusion of high-throughput experiments the amount of accessible biological data started to grow exponentially. Data are now dispersed throughout several different databases and their interpretation and analysis require sophisticated tools for data management and information processing. Organized in this way biological information is encapsulated within database schemes and is not easily available to scientist. Instead knowledge can be better captured and made available to both humans and computers thanks to ontologies Bio-ontologies are indeed fundamental components in biological data integration and annotation. In the last decade, several groups have been developing controlled vocabularies and descriptors mainly for the annotation of this kind of data. For instance, the Metabolomics Standards Initiative (MSI) ontology working group is developing an ontology to facilitate the consistent annotation of metabolomics experimental data [34]. Besides the well known Gene Ontology [24] there are many other initiatives focused on standardization and ontology development that may be cited, such as MIAME [35] and PRIDE [36]. These are mainly centred on the development of ontologies and bioinformatic tools for biological data annotation. However, only a few projects have been developed for the representation and formalization of the experimental protocols and the automatic operations producing such experimental data. A formal definition of scientific experimental design, laboratory entities and operations is undoubtedly important, also in the case of manually executed experiments. The development of an ontology of experiments is a fundamental step in the formalization of science, since experimentation is one of the most characteristic feature of science.
In this regard, the EXPO ontology of scientific experiment has been developed to formalize generic knowledge about scientific experimental design, methodology and representation of results [37]. The Ontology for Biomedical Investigations (OBI) addresses the need for controlled vocabularies not only for the experimental data annotation but also for the representation of investigations in the Biological and Biomedical Sciences [38]. Ontology represents the design of an investigation, the protocols and instrumentation used, the material used, the data generated and the type of analysis performed.
EXACT [39] is an ontology of experimental actions that can be used as a formalism suitable for a structured representation of laboratory protocols. The core of this structured vocabulary is a hierarchical classification of experimental actions based on goals of actions: the goal of separation, the goal of transformation and the goal of combination.
Exploiting the properties of EXACT in representing protocols, we expand and then combine its formalism with another, that of workflows to define a strategy that can be more expressive and efficient in protocol formalization.
Workflows
In the workflow context, a process can be considered as the set of activities performed by different entities and their execution ordering through different constructors, which permit flow of execution control (e.g. sequence, choice, parallelism and join synchronization). An elementary activity is an atomic piece of work [40].
A workflow is therefore the structured definition of a process used for the automatic management of particular activities. The formalization of a process (workflow schema) involves the definition of activities, the specification of their order of execution (i.e. the routing or control flow) and of the responsible actors. Other features should be taken into account too, e.g. the data flow [40] or the various ways in which resources are represented and utilized in workflows [41].
There are three well established formalisms applied for the specification/modelling of processes: Business Process Execution Language (BPEL), Business Process Modelling Notation (BPMN), XML Process Definition Language (XPDL).
BPEL [42] is an execution language based on XML specification for the formal description of business processes based on Web Services.
BPMN [43] is a graphical notation based on intuitive flowcharts for the definition of business processes. Originated from the Business Process Management Initiative, in 2005 it was merged into OMG [44] and in 2007 the version 1.1 became a standard.
XPDL [45] is a markup language created to ensure interoperability among different workflow management tools in order to handle workflow processes. It was designed to permit the exchange of process definitions, addressing both the graphical and the semantic notations of the relevant workflow. Born as a support for serialization of BPMN constructs, it also incorporates also information relating to the graphical representation (e.g. the position of blocks in the workflow). XPDL was developed by the Workflow Management Coalition (WfMC) [46], a consortium formed to define standards for the interoperability of workflow management systems.
In the last few years the interest for workflow development has seen a considerable growth in the scientific community [47]. Scientific workflows can be considered as the executable description of scientific processes [48]. Similar in nature to business workflow, they have the distinct characteristic of operating on large amounts of heterogeneous data. In particular, they are generally data-flow oriented instead of being event-based, and very versatile in composing flows of execution. In bioinformatics, in particular, workflows are extremely valuable for programming the steps of in silico experiments in a visual intuitive manner. However workflows are still not commonly adopted in the formalization of protocols for biological laboratory experiments.
There are several available tools for workflow design and enactment [49], for instance JPEd [11], an open-source visual editor for general-purpose workflows. Taverna [50], developed by the myGrid project, is the workflow platform most commonly used for the systematic analysis of vast amounts of data, but it does not allow description of laboratory experimental procedures. Taverna workflows can be shared among the scientific community thanks to Web 2.0 initiatives like myExperiment [51]. This social web site enables scientists to publish their workflows and in addition to execute, reuse and share workflows of other groups. In this way myExperiment contributes in reducing time-to-experiment, in sharing knowledge and expertise and in avoiding reinvention [52].
The proposed approach
We propose to combine ontologies and workflows for formalizing protocols used in biological laboratories. Workflows permit an intuitive representation of protocols, allowing the synchronization of different executors. Our workflow specifications can be stored and shared using the XPDL standard interchange language [45]. By means of ontologies, laboratory knowledge can be directly embedded into the workflow model and shared using the standard OWL model, the Ontology Web Language [53]. In this manner the precise constraints defined in the ontologies are transferred to the protocol building blocks.
To allow the integration of workflow and ontologies we have developed the COW tool, an add-on for the JPEd workflow editor that permits an easy and intuitive design of "ontologized" workflows. This allows a formal representation of laboratory protocols dealing with specific equipment and operations.