Resources for Developers

This section contains all information that are necessary to developers that want to collaborate with the Open Data Hub team (e.g., developers that send pull requests to the Open Data Hub repositories) or are contracted to write code for the Open Data Hub project (Open Data Hub Core Hacker) or app

Guidelines for Developers

Open Data Hub is a collection of software, databases, and services coordinated and hosted by IDM Südtirol / Alto Adige. Currently, Open Data Hub systems are related to mobility and tourism. In the future Open Data Hub might diversify into more fields.

Companies and developers contributing to Open Data Hub must follow the guidelines listed in the documents as close as possible.

The aim of the Open Data Hub Developer’s Guidelines (“The Guidelines”) is to simplify the hosting and maintenance of the software, databases, and services by the Open Data Hub developers and maintainers at IDM (“the Open Data Hub team”).

The Guidelines describe the conventions to which a developer must adhere, to be able to become an active Open Data Hub developer or to see his work being incorporated into the Open Data Hub. They are split in two parts:

  • Platform Guidelines explain the preferred programming languages, how to expose the data after you manipulated them, the use of third-party libraries or plugins, and so on.
  • Database Guidelines clarify how to design a database that shall become part of the Open Data Hub platform.

Both of them are summarised in the remainder of this section, and can be found in full version in the pages Platform Guidelines - Full Version and Database Guidelines - Full Version respectively

Platform Guidelines - Bignami Version

The Platform Guidelines contain the software and programming language requirements, coding conventions, and directions for development. This section contains only the most important points.

Please check the full version of this document at Platform Guidelines - Full Version if you want to know more details, if you have some doubt or if what you were looking for is not mentioned in this summary.

  • Programming Language is Java, in its latest or second to last version.
  • The source code must be documented according to the Javadoc style guide and tags.
  • Java components of Open Data Hub can be developed as libraries, standalone applications, or server applications running in Apache Tomcat.
  • The source code is built nightly; build configuration should be provided in either Ant or Maven (preferred), Makefile, or shell script.
  • Third party libraries can be used, provided they are established, FOSS-licenced, and do not overlap functionalities. This applies also to third party libraries used in application developed in other languages.
  • Front-end applications can be deployed in Javascript, version EC 2015, and must support modern browsers.
  • Node.js can be used to deploy headless or server applications.
  • Web front ends use the latest HTML and CSS versions, must work on mobile devices (responsive design) and should implement some basic accessibility principle.
  • JSON must be used as exchange language, while XML is welcomed as well.
  • The latest or second to last version Apache Tomcat is used to run server application; only API/REST end points have direct access to the database server.
  • There’s no file system persistence, everything must be stored in the DB. JDBC data source and passwords should be stored in environmental variables.
  • Pay attention to RAM usage, applications will undergo load testing.
  • PostgreSQL RDBMS is used, but not in its recent release (expect to use 2-3 versions before the latest), PostGIS spatial extension is required as well.
  • Developers will have an unprivileged role to access the DB and must follow best practices to query the DB from Java/Javascript.

Database Guidelines - Bignami Version

The Database Guidelines contain the database design and database programming principles along with software version requirements. This section contains only the most important points.

Please check the full version of this document at Database Guidelines - Full Version if you want to know more details, if you have some doubt or if what you were looking for is not mentioned in this summary.

  • The database can be designed with one of the Relational Model, Object-Relational Mapping (ORM), or Semi-structured Data methodologies.
  • A database designed with either methodology must be shipped with DDL - schema files containing the CREATE statements.
  • Each database must include a version table and indices on tables.
  • All (SQL) source code must be well-documented, with in-line comments and higher level documentation.
  • Use standard database features - Sequences, primary and foreign keys, constraints (unique, check, not null), default values, views, and so on and so forth.
  • Separate business logic from database design; avoid stored procedure as much as possible.
  • Small procedures and functions, if needed. must be written in PL/PgSQL.
  • Do not use foreign data wrappers.
  • Consider using declarative partitioning for large tables - and contact Open Data Hub team beforehand to discuss it.
  • Always use UTF8 character encoding and do not override it.
  • Default collation is en_US, which works well for German and Italian as well.
  • Never use money type, but numeric.
  • Dates and time stamps must be store to avoid ambiguity. Never store them as text, but rather use their data types, date (in UTC format) and timestamp with timezone. Unix timestamp is accepted as well.
  • When using or manipulating JSON data always follow ISO_8601 standard.

Platform Guidelines - Full Version

Changelog

  • 2018-05-28 version 1.0
  • 2018-03-30 version 1.0-beta

This document represents Part 1 of the guidelines and presents the preferred programming languages, databases, and protocols to be used, data exchange and exposition methods, coding conventions, and regulates the use of third-party libraries.

There are scenarios where an exemption from the guidelines is acceptable. The following is a non-exhaustive list of such scenarios.

  1. Use of foreign technologies. The development of a Open Data Hub component requires the use of platforms, languages or generally technologies that are different from the ones listed in the guidelines. An example might be a component that depends on an already developed custom library written in a programming language not listed in the guidelines.
  2. Use of technologies that are not mentioned in the guidelines. Future Open Data Hub component might require technology that is not listed at all in the guidelines. An example is a component that must be hosted on specific hardware needed for machine learning platforms.

A Open Data Hub contributor who runs into such a scenario must contact the Open Data Hub team to discuss that specific scenario. If the exemption is reasonable and can be motivated the Open Data Hub team will agree and allow it. To avoid misunderstandings, contributors must expect to get a written statement about such a decision.

Note

If you can not find any answer to your question or doubt in this document, please contact the Open Data Hub team or open an issue in the github repository of this document.

Platforms and Architectural Considerations

Java server applications running in Apache Tomcat

Apache Tomcat is a well established, light weight FOSS web server that implements among others the Java Servlet specification.

The Open Data Hub team generally uses the latest or second to last release of Tomcat, to run Java server applications in the previously mentioned contexts:

  • API/REST end points.
  • Web applications.

The desired design is that only API/REST end points directly access the database server, while web applications just talk to the API/REST end points.

Automatic Deployment

Each Tomcat instance normally runs a few web applications, hence expect a Open Data Hub web application’s WAR file to be bundled together with other WAR files to run on a given instance.

The automatic build systems takes care of this bundling and deploying. It is therefore very important that all WARs can be build automatically, as mentioned in the section about Java.

No File System Persistence

Currently, the Open Data Hub team uses Amazon Web Services for Tomcat hosting, in particular the managed service known as Elastic Beanstalk. While there is no hard dependency on this provider -that could be changed at any point in the future, the architectural design of Elastic Beanstalk has partly modelled/shaped the engineering choices of the Open Data Hub team in the design of its web application.

First and foremost, servers are considered volatile. This means a Open Data Hub component running in Tomcat can not expect to see a persistent file system!

All web applications must therefore be developed with the database as the only persistent storage layer. This architectural choice has a few advantages:

  • Web applications can be distributed over more than one web server (horizontal scaling), increasing availability and performance.
  • Backup and disaster recovery is very much simplified - a failing instance can just be replaced by a new instance and the application can be deployed again.

Developers must pay particular attention to this point: There is no persistent file system. Hence no changeable configuration files, no application specific log files. Everything is stored in the database.

Data Source

One subtle point is the question “Where is the JDBC data source and password stored?”. It cannot be stored in a file and it must not be stored in the source code or context files. The recommended way to store this information is in Java environment properties.

The system will set these variables when launching Tomcat:

JDBC_CONNECTION_DRIVER=org.postgresql.Driver
JDBC_CONNECTION_STRING=jdbc:postgresql://host:5432/db?user=username&password=secret

The developer can then read them with:

System.getProperty("JDBC_CONNECTION_DRIVER");
System.getProperty("JDBC_CONNECTION_STRING");
RAM Usage

The Open Data Hub encompasses a considerable number of web applications that are bundled together to run on a few Tomcat server instances. Contrary to popular belief, RAM is not an infinite resource. Contributors are kindly reminded to pay attention to the RAM usage of their web applications, since load testing is expected.

Java standalone applications, running headless

Besides wapplications running in Tomcat, the Open Data Hub also has headless standalone applications written in Java or JavaScript/Node.js.

These are meant for special use cases, such as compute intensive jobs or batch processing, made upon request.

Almost everything said in the previous section about Tomcat, applies here as well.

Again, the preferred way to run these applications is in an environment where servers are volatile and the only persistence layer is the database.

PostgreSQL

PostgreSQL is one of the most established RDBMS on the market and is generally described as being by far the most advanced FOSS RDBMS and therefore it has been chosen as the primary database system for Open Data Hub.

There is a new major release of PostgreSQL per year and each release is supported for 5 years, according to the versioning policy. Contrary to the case of the other products mentioned in these guidelines, the Open Data Hub team generally will not run the latest or even previous version of PostgreSQL. Expect the version available for Open Data Hub to lag about 2-3 years behind the latest available release.

Extensions

Most, if not all of the extensions distributed with PostgreSQL, can be expected to be available, together with the third-party spatial query extension PostGIS is also available.

Other extensions are very likely not available, so ask the Open Data Hub team if in doubt.

Accessing the Database

Application developers will get one or more unprivileged database roles to access the database. Access will be done via JDBC when using Java, or via any of the available PostgreSQL modules for Node.js when using JavaScript.

The data source strings must be parsed from the environment variables (see section Java server applications running in Apache Tomcat).

The maximum number of concurrent database sessions will be generally limited per role, therefore each developer must clarify with the Open Data Hub team what an acceptable number is, depending on the application.

Since PostgreSQL will refuse a connection if that number is exceeded, developers must take this number into account, whether they configure a connection pool or not.

Open Data Hub databases generally are configured to accept connections only from the known hosts where the application servers are deployed.

Contributors must follow well known best practices when querying the database from Java or JavaScript:

  • When processing large datasets, consider setting smaller values of fetchsize or equivalent parameter to avoid buffering huge result sets in memory and running out of RAM.
  • When performing a huge number of DML statements consider switching off any client side autocommit feature and rather bundle statements into transactions.
  • Do not open transactions without closing them, in other words, do not leave sessions in transaction!
Database Design and Usage

This section has been moved into its own document, Database Guidelines - Full Version.

Database Guidelines - Full Version

Changelog

  • 2018-05-28 version 1.0

This document represents Part 2 of the Open Data Hub Developer’s Guidelines and clarifies the database design criteria for developers who contribute their own databases designs to the Open Data Hub platform.

Basic information about the PostgreSQL versions, PostgreSQL extensions and how to access PostgreSQL from Java or JavaScript, intended for developers that contribute code that just uses an existing database, are explained in the Platform Guidelines - Full Version document as well. Please refer to that document for a general introduction to the scope of the present guidelines.

Database design methodology

The Open Data Hub team is generally agnostic about database design and acknowledges the existence of different design and development methodologies.

Specifically, the following methodologies are well known and acceptable:

  1. Relational Model. The data schema is implemented using normalized relations with standard SQL concepts (schemas, tables, columns and keys). The CREATE statements are written by the developer.
  2. Object-Relational Mapping (ORM). The underlying data schema is based on the relation modal, but the CREATE statements are generate by an ORM framework that automatically maps entities to relations.
  3. Semi-structured Data. Entities are stored in a semi-structured format. For the Open Data Hub the preferred format is JSON. Specifically, the recommended design is to map each entity to its own table. The table should have at least two columns: one traditional ID column and one JSON data column. The (simple) CREATE statements are written by the developer. The JSON data column must use the PostgreSQL native data type jsonb (see binary stored JSON in PostgreSQL documentation).

PostgreSQL supports all three methodologies well. It is also possible to have a hybrid design mixing 1. and 3.

A developer contributing a database design to Open Data Hub must provide the DDL , a.k.a. schema files containing the CREATE statements.

Like all source code files, the schema files must be commented in-line and accompanied by additional, higher level documentation.

Besides source code file comments, database objects must also be commented with the SQL comment command (see Sample Code 1 below).

Updates must be provided in the form of ALTER statements, so the modifications can be easily applied to existing databases (see Sample Code 2 below).

All database designs should contain a version table, where the version is stored (and updated with each update).

The Open Data Hub team likes to stress this point: do not just commit database schema dumps, but rather treat SQL-DDL files as source code and cleanly distinguish the initial creation and later updates.

Sample Code 1: A DDL source file called foo.sql

-- foo.sql
-- a document with appendices
--
-- changelog:
-- version 1.0
--
-- copyright, author etc.

create sequence foo_seq;

create table doc (
    id      int default nextval('foo_seq'),
    title   text not null,
    body    text,
    primary key(id)
);

comment on table doc is 'stores foo documents';

create table appendix (
    id      int default nextval('foo_seq'),
    section char(1) not null,
    body    text,
    doc_id  int not null,
    primary key(id),
    foreign key (doc_id) references doc(id)
);

comment on table appendix is 'stores appendices to foo documents';

create table foo_version (
    version varchar not null
);

insert into foo_version values ('1.0');

Sample Code 2: Update to schema of foo.sql, version 2.0:

-- foo.sql
-- a document with appendices
--
-- changelog:
-- version 2.0 - added a field
-- version 1.0
--
-- copyright, author etc.

BEGIN;

alter table doc add column publication_date date default current_date;

update foo_version set version = '2.0';

COMMIT;

The explicit transaction (BEGIN - COMMIT) will make sure the DDL update is applied cleanly or not at all. Note that DDL statements in PostgreSQL are transactional.

If methodology 2 (ORM) is chosen, the contributor should provide the cleanest DDL output the framework provides.

Contributors can expect their database design to be stored into a schema whose name is determined by the Open Data Hub team and executed as a non-privileged user account that has the given schema in its default search_path (see DDL schema path in PostgreSQL documentation).

Unless there is a specific reason, contributed designs must use only a single schema without using its explicit name, because that will be determined by the search_path.

Contributors are invited to make good use of standard database features, including -but not limited to:

  • Sequences.
  • Primary and foreign keys.
  • Unique constraints.
  • Check constraints.
  • Not null constraints.
  • Default values.
  • Views.

Stored procedures and functions, foreign data wrappers

The Open Data Hub team would like to avoid stored procedures and functions as far as possible. Business logic should be implemented in the middle tier, not in the database system.

Hence, the general rule is that database designs submitted to the Open Data Hub must not contain business logic operations.

However, (small) utility procedures and functions, especially with respect to triggers, are allowed. When used, these procedures and functions must be written in PL/PgSQL. Other server-side languages, even the trusted ones, are neither allowed, nor can they be expected to be available.

An example of such an allowed instance of a procedure is an audit trigger that, for any changes made to Table A generates a log entry that is stored in Table B.

Foreign data wrappers (SQL/MED) must not be used.

Indices and Partioning

The submitted database designs must include creation of indices on tables.

Of course, the Open Data Hub team will monitor database performance and might be able to add indices at a later time. However, not anticipating obvious index candidates is considered a bug.

The database design contributor knows best what tables and what columns will benefit from indices, when the number of records grows.

In particular, if methodology 3 (JSON) is chosen, PostgreSQL provides specialized multi-dimensional indices of type GIN to index the jsonb data type.

If the contributor anticipates designs with large tables (say more than 100M records or more than 5 GB on disk) and expects queries needing to sequentially scan those tables, declarative partitioning should be considered. The contributor must then contact the Open Data Hub team to agree on a declarative partitioning scheme in advance.

Encoding, collation and localization

All Open Data Hub PostgreSQL databases use the UTF8 character encoding as default encoding and this must not be overridden by a database design contributor.

The Open Data Hub team wishes to avoid any character encoding issues by using UTF8 for everything.

The default collation is en_US. For PostgreSQL running on Linux this collation already behaves reasonably for German and Italian:

select * from t order by s collate "en_US";
 t
---
 A
 À
 Ä
 B
(4 rows)

A contributor is free to add a custom collation such as de_DE or it_IT, either at the DDL level or the query level (see PostgreSQL documentation on collation), although there is most likely no need to apply other collations.

A database design must not use the money type. Currency amounts must be stored in fields of type numeric and the currency must be stored separately.

One important aspect concerns dates and timestamps.

Since the Open Data Hub applications span multiple regions and time zones, it is very important to be precise about date and time formats and time zone information.

Dates must be stored in the appropriate date data type. Dates stored in this data type will be automatically converted into the client native format when queried. Never store dates as text because this creates ambiguity. For example, what date represent the string 10-07-2018? Is it the seventh of October 2018 or the tenth of July 2018?

The same holds true for timestamps that must be stored in the appropriate timestamp data type. Besides avoiding format ambiguities, this data type also includes also the time zone.

Note

PostgreSQL supports also a timestamp without time zone data type, according to the SQL standard. However, this data type must not be used as it does not store the vital time zone information.

Here ist the output of two queries executed almost at the same time on two PostgreSQL servers running in different time zones.

This is UTC (no daylight saving).

# select now();
            now
-------------------------------
 2018-05-28 00:28:25.963945+00
(1 row)

And this is CET (with daylight saving), 2 hours ahead of UTC:

# select now();
            now
-------------------------------
2018-05-28 02:28:27.121242+02
(1 row)

You can see that these two queries were executed (almost) at the same time thanks to the time zone information (+00 vs. +02). Without time zone information, the two time stamps appear as separated by two hours.

Note

When using the date and timestamp data types there is no format issue at all, as the PostgreSQL client libraries automatically convert from and to the client native format. For example a Java Date object is automatically converted to an SQL date value.

Sometimes developers need to convert to and from text. In case a contributing developer wishes to do this using PostgreSQL functions, they must use functions to_date() and to_char() (see PostgreSQL documentation on function formatting).

For example:

-- insert into date field d converting from German text:
# insert into dates (d) values (to_date('28.5.2018', 'DD.MM.YYYY'));

-- select date field d and convert to German text:
# select to_char(d, 'DD.MM.YYYY') from dates;
  to_char
------------
 28.05.2018
(1 row)

Sometimes timestamps are stored as numbers, the so called Unix time stamp (see unix timestamp on wikipedia).

This is also acceptable, as the Unix time stamp always follows UTC and is therefore unambiguous.

For JSON data, contributors must make sure that the textual representation of dates and timestamps follow the ISO standard ISO_8601 (see more on Wikipedia). Examples:

  • “ts”:”2018-05-28T00:54:28.025Z”
  • “d”:”2018-05-28”

PostgreSQL accepts these strings as inputs for timestamp and date types even as text (there is an implicit type cast).

Also note JavaScript has a Date.prototype.toISOString() method.

Development, Testing, and Production Environments

Note

Information in this section is still provisional!

Figure 3 shows the various environments which compose the whole Open Data Hub development process.

_images/DTP.png

Figure 3 Diagram showing the development, testing, and production environments in the Open Data Hub project.

On the right-hand side, the internal structure of development is shown, while on the left-hand side, how external, and potentially worldwide collaborators can contribute to and interact with the Open Data Hub team.

Internally, two distinct and separate environments exist: testing and production. The former is updated daily, while the latter only when the expected result (be it a new feature, a bug fix, or anything else) is ready to be published.

Both environments are updates with Continuous Integration using Jenkins, which monitors the git repositories and updates the environemnts.

External developers can push their own code to the git repositories (provided they have been granted with the permission to do so) and expect their work to be reviewed and tested by the Open Data Hub team.

GITHUB Quick Documentation for Contributors

This section guides you in setting up on your local workstation the (forked) git repositories needed to contribute to the Open Data Hub project, along with some troubleshooting concerning pull requests and merge conflicts. For more detailed help, please refer to the online Github help, at https://help.github.com/.

Prerequisites

In the following documentation some example names are used. Please replace them with your names:

  • You need an account on Github to be able to fork projects and contribute to the Open Data Hub project.
  • Replace your-username with your username on GitHub.
  • Replace feature-branch with the branch name you will develop in your forked version.

Project Checkout

Before starting the development, you need to fork the original (upstream) repository.

  1. Navigate to the repository on GitHub, e.g., https://github.com/noi-techpark/bdp-core.

  2. Create a fork of the repository by clicking on the Fork button. If you are not logged in, you will be asked for a github username and password.

    _images/fork.png

    Figure 4 Fork the repository.

  3. Navigate to your forked repository on GitHub, e.g., https://github.com/your-username/bdp-core.

  4. Check out the forked repository on your local machine, using the link that appears in your repository (see Figure 5):

    git clone git@github.com:your-username/bdp-core.git
    
    _images/checkout.png

    Figure 5 Clone the repository.

Create a pull request

In order to let your contribution be accepted in the Open Data Hub code base, you need to follow the following steps.

  1. Checkout the development branch:

    git checkout development
    
  2. Create a new branch from the development branch locally on your machine:

    git checkout -b feature-branch
    
  3. Make some changes to the code and commit them:

    git add -A
    git commit -m "Some commit message"
    
  4. Push the new branch to GitHub:

    git push --set-upstream origin feature-branch
    
  5. Navigate to your feature branch on Github (https://github.com/your-username/bdp-core/pull/new/feature-branch) to create a new pull request (see Figure 6).

    _images/create-pull-request-development.png

    Figure 6 Create a pull request.

    You can write some description as well, to describe your changes.

  6. Commit and push any changes of the pull request to this new branch.

  7. For every commit the continuous integration pipeline will execute the tests and display the results in the pull request, like shown in Figure 7

    _images/show-pull-request-ok-development.png

    Figure 7 Show outcome of a pull request.

  8. In addition, the detailed logs can be viewed under https://ci.opendatahub.bz.it.

Syncing a Fork

Your forked repository does not receive the updates of the original repository automatically. To sync for example the development branch of the two repositories and to keep the forked repository up-to-date with all the latest changes of the development branch from the original repository, the following steps have to be performed.

Before you can sync your fork with the original repository (an upstream repository), you must configure a remote that points to the upstream repository in Git. A more detailed description for the following steps can be found in the online Github help https://help.github.com/articles/configuring-a-remote-for-a-fork/.

  1. List the current configured remote repository for your fork.

    git remote -v
    
  2. Specify a new remote upstream repository that will be synced with the fork.

    git remote add upstream https://github.com/noi-techpark/bdp-core.git
    
  3. Verify the new upstream repository you’ve specified for your fork.

    git remote -v
    

You need sync a fork of a repository to keep it up-to-date with the original repository (upstream repository). A more detailed description for the following steps can be found in the online Github help https://help.github.com/articles/syncing-a-fork/.

  1. Fetch the branches and their respective commits from the upstream repository. Commits to development will be stored in a local branch, upstream/development

    git fetch upstream
    
  2. Check out your fork’s local development branch.

    git checkout development
    
  3. Merge the changes from upstream/development into your local development branch. This brings your fork’s development branch into sync with the upstream repository, without losing your local changes.

    git merge upstream/development
    

Resolving Merge Conflicts

When creating and working on a pull request, it could happen that the destination branch of the original repository will change. These changes could result in merge conflicts when pulling your code, like shown in Figure 8.

_images/merge-conflicts-conflicts-development.png

Figure 8 A Merge Conflict.

To resolve merge conflicts, the following steps must be performed.

  1. Sync your forked repository and make sure your local destination (development) branch is up to date with the original (upstream) repository branch.

  2. Check out your feature branch.

    git checkout feature-branch
    
  3. Merge the changes of the development branch to the feature branch.

    git merge development
    

    The command will output the files with merge conflicts. See sample output in Figure 9.

    _images/merge-conflicts-output-development.png

    Figure 9 Merge conflicts output.

  4. Go the the listed files of the previous output and resolve all merge conflicts. The conflicts in the files begin with <<<<<<< and end with >>>>>>>. The ======= separates the two versions.

    _images/merge-conflicts-solving-development.png

    Figure 10 Solving a merge conflicts.

    You can resolve a conflict by simply deleting one of the two versions of the code and the inserted helper lines beginning with <<<<<<<, =======, and >>>>>>>.

    If none of the two versions is completely correct, then you can delete the conflict entirely and write your own code to solve the conflict.

  5. Add all resolved files to the index, commit the changes and push the changes to the server.

    git add -A
    git commit
    git push
    
  6. After resolving the merge conflicts, the pull request can be accepted.

    _images/merge-conflicts-resolved-development.png

    Figure 11 A solved merge conflict.

A more detailed description can be found in the online Github help: https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/.