ETL - Extract-Transform-Load hub for data aggregation

Tags: #<Tag:0x00007fd5e55672b0> #<Tag:0x00007fd5e55671e8>

The ETL (extract, transform, load - hub) is relevant, when it comes to data aggregation from various sources.

ETL describes the tool-chain and process to extract mapping data from various partner´s databases, transform it into a format, that is processible for another application, and load it into the aggregation system.

There is basically 3 possibilites (that we found up to now),
to produce an ETL serving our purpose of map aggregation:

1.1. extract by scraping based on heuristics + mannually written adapterscustomized to the partners´ database structure.

  • Heuristics analyse content and scrape the relevant data + manually written adapters, that fit the database structure of the partner commoning the data.
  • that data the gets copied in a queue and gets uploaded in daily (e.g.) intervals

1.2. extract by scraping based on heuristics + a standard adapter, where partners can structure their database for.

  • Heuristics analyse content and scrape the relevant data + we build one, (or several?) standard adapter(s)
  • they find their own way in build, structure or restructure their database

2. we convince partners to build their data in an inter-operable way (geo-json).

  • As the data are stored in a standard format in the first place, it does not need to be transformed
  • data is right away available for aggregation and exchange

@almereyda @alabaeye is that a correct representation of the ETL conversation on Work Week day 1?
@gandhiano @species do you follow? Questions, suggestions?

#Three best practices for building successful data pipelines

Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:

  • Reproducible
  • Consistent
  • Productionizable

While these areas alone cannot guarantee good data science, getting these three technical aspects of your data pipeline right helps ensure that your data and research results are both reliable and useful to an organization.

An aggregation Hub means centralisation, which @almereyda surely will confirm is not what we want.

An aggregation system which simply collects data unfolds a lot of problems

  • licensing: You cannot combine differently licensed data into one database:
  • If some datasets require Attribution (e.g. CC-BY), every printed or otherwise generated map has to attribute to each of the datasources - if they get more than say, 5-10, the license statement needs a lot of space on every map
  • If some of the datasets are non-commercial only, you impose the NC-requirement on all other datasets in the same database and, are no longer allowed to embed a map in e.g. a standard wordpress site on
  • Sources, that have the possibility for users to add data (e.g. mostly do not force the user to agree on a data license. Data that is generated by users without agreeing to a license (e.g. PD or ODbL) CANNOT be used (Even if the project maintainer would agree!), because its copyright remains on the individual contributors.
    • Re-Licensing of datasets to be compatible with TransforMap is only possible when EVERY contributor to the source datasets agrees - which is impossible when anonymous edits are allowed!
  • Share-Alike licenses: combining different licensed data with SA-Licenses is legally NOT possible. Period.
  • Synchronisation issues:
  • Updating/changing of data has to happen in the source databases, or we need very complex locking mechanisms for synchronizing two writeable databases.
  • How to handle duplicate data?
    • what to do when two source datasets differ in attributes for the same POI, which one is “true”? How to prevent “pingponging” of attribute values when the sources are synchronized at different intervals?
    • How to update data that is present in more than one database?
    • what to do when one data point vanishes somewhere - has it been deleted because the POI no longer exists, or by error? How do we decide when to remove it from our DB?

Therefore, I vote for not building a centralized hub that aggregates data, but to stay decentralized in a way that different attributes of data end up in different databases - storing each type of data where it belongs (so no content needs to be duplicated!), and enhance the inter-linking of databases together.

I would add a 4th possibility: Convincing all our partners to store their data in the respective (already existing!) databases, that are established worldwide for the type of data:

  • Wikidata (or DBpedia?) for interlinking and translations
  • OpenStreetMap for geodata
  • Wikimedia Commons for media content
  • If a partner has data that doesn’t fit in any of the established DB’s, he is invited to open (with our help, if needed) a new “Commons” database for this particular datatype!
  • If someone has “private” data, he can always host that himself and link from it to the other DBs.

This is the scheme I suggested in the SSEDAS thread too.

Now, go on punishing me for destroying the idea of a centralized data hub - but please hit me with better ideas :wink:

I wonder, how we disect that topic and think, we should start it´s own category.

There is the following topics:

I actually doubt that hypothesis, as I have the very easy understanding, that data is simply flagged/marked with the respective license, and can be filtered by licence to further work with it.

Form of centralized / decentralized / multicentric databases:

Partnering from where we are with existing community-based bodies and infrastructure

My honest oppinion is, that we are to early for that, and first have to prove, that we a partner worth cooperating with, so the other side with more experience would make adaptions in Interfaces, Workflows, regulations, that we might need.

  • The “collaboration” with the OSM community (taken @species apart, who´s contribution is amazing) was quite meager.
  • Wikidata might be a good solution, but at the same time might not. Let´s find out the risks, opportunities, threads and possible solutions?
  • I doubt, that it is in the understanding of Wikimedia Commons to have hundreds of images of initiatives stored there. Who would like to investigate?

I also (from my limited understanding) think, that real time synchronization issues might emerge, that we currently can´t even think of.
Personally if evaluating the risks of

  • storage synchronization (data-copies in different places, of which we have full governance on the body of data we need for the live production to the user interface)

  • risk for storage synchronization problems high, especially in the beginning

  • chances of fatal risk very little, as the live delivery of data comes from one source and does not need synchronization

  • live process synchronization (if we decentralize into several databases of several governing bodies right from the beginning)

  • risk for the live production process high, especially, as we depend on infrastructure out of our governance

  • chances of fatal risk very high, as the live delivery of data to the user-interface comes from various sources, without governance synchronization.

  • problems of time lag for production process
    a possible Issue to take into consideration.

A post was split to a new topic: inquirey: Host images in Wikimedia commons?

@species I will get myself out of that discussion now.
@almereyda once stated that, I should not be involved so much in technical discussions.

As I do not have any competency on technical decisions apart from my everyday beliefs, I think it is better to simply be an observing bystander at this point.

If you have specific requests to involve me, I am very happy to contribute my 5 cents.

1 Like

The sentence “cannot combine” is a summary of the points explained below.

By combining different attributes of data from different databases into one table, you are creating a “derivative work” - think of stitching several different photos together, saving into one JPEG.

Surely, it is technically possible to cut them apart again automatically, when someone is making queries to the database and can get the data delivered in fine slices (each tagged with the respective license).

But: The database itself would be in a legal unknown (and often impossible state). You cannot offer the whole database for download, which contradicts our dedication to Open Data!