Geodata storage evaluation

The poll shows a clear tendency for Tomorrow morning.

So let’s meet on Thursday, Feb 4th 11:00 CET in Mumble.

Thanks to the voters for your participation in the poll. Of course anyone interested can join too :slight_smile:.

See you tomorrow!

2 Likes

I’ve set up a hackpad for this meeting:
https://notes.transformap.co/TransforMap-Database-Requirement-Analysis-Meeting-54WFdDx9tXl

On today’s Mumble meeting about criteria for geodatabases, we’ve had the following insights:

  • “Master” database (for TransforMap POIs) and “cache” (ETL Hub) have different requirements.
  • We should probably split “caching” and “master” databases.
  • Requirements can be separated into “for the API” and the “pure database”.

So the requirements agreed on are:

Both systems:

  • Open Source,
  • Active community behind
  • good reputation, good security (difficult to hack)
  • UTF-8
  • should allow key-value storage of POI attributes, no fixed columns
  • accessible via https (encryption)
  • has to allow geo-queries (Bounding-Box/Nearby searches)
  • It is only needed for POI DB, when we don’t have the cache from the beginning
  • http-interface that returns JSON
  • Database ports have not to be exposed directly to the open Internet

For “master” POI storage of TransforMap

  • history for all objects (already provided as feature?) (with linking to previous versions possible)
  • has to provide permanent unique identifiers for each object (UUID) ->static links that lead to the most recent version
  • possibility to look up previous versions
  • replicating capability
  • ability to get data from less than one minute (nice to have)
  • diffs on files would be a performance gain (nice to have)
  • possibility to use either live data or cached data

For the “caching” database:

This was the Extract-Transform-Load-Hub (ETL) from CHEST.

  • modularity: design concept to easily integrate more data sources
  • performance should be able to scale up with size
  • User needs to see how out of date the cache is
  • write back capability - no, just fetches

API

all other requirements were API requirements. They were collected, but not discussed further because of time constraints. Will probably discussed in another meeting.

  • User accounting for individual users
  • oauth/2 / OpenID connect for users
  • Able to make SPARQL queries (e.g. to Wikidata)
  • Send request for updating/refreshing (single items) from the source database
  • would only work for some datasets (TM DB, OSM, Wikidata,…) that have an API
  • “expiring” of data

Nice 2 have:

  • webinterface
  • for looking up individual datasets and their history
  • user management via web
  • integrated (web) editor

The next step is to make a table with these requirements as columns, and to add the different database solutions as rows. @gandhiano hat some tools for this a time ago.

The decision on the final database(s) will be made at the beginning of the Witzenhausen Hackathon, probably Monday/Tuesday. See the poll here on Framadate.

1 Like

@maxlath, @gandhiano, @species, @almereyda (did I miss anyone?), Good to talk with you people today. During the meeting, I was trying to remember an article I’d read last year. Finally I did: Who’s on first. I think this is an interesting read. A database of POIs and a gazeteer have a lot in common. There’s a discussion here in another thread.

On a separate issue: Is there anywhere I can read about the “big picture” design that includes the databases whose requirements were discussed today?

2 Likes

First we have to acknowledge there is more discussion about future options than extending existing experiments, with the iD fork, the demo maps and the Semantic MediaWiki being the only self-hosted geo services right now.

Second working purely in the abstract, imaginative layer of specification of our needs, we are prone to the trap of mixing an implementation with the function requested from it.

Looking at how we have been overengineering until now, we can rest assured OAuth and SPARQL, even with geoextensions, are out of scope for the next half year.
Now we have to separate our mind again to distill further requirements regarding

  • Query- and Accessability
  • Storage and Data models
  • Failover and Recovery scenarios

We understand that in a world of Service-Oriented-Architecture and connected microservices, a lot of experience in distributed systems design is not directly available to us. We are struggling with separating Graph, Column and Row stores, not to mention external indexing services and multi-model or only partly open source databases.

When it comes to Querying, we see the predominant *QL dialects, but also semi-proprietary approaches and far distant futures with Linked Data Fragments. Yet we want to encourage users to host themselves and only publish what they want. So we are agnostic about how users offer us their data. Still we have to make a decision for ourselves.

A custom web service may suffice for now and could be extended to offer further dialects. In the beginning we are probably not even getting the linked data part right, as we never discussed any JSON-LD contexts for the models needed. But as we probably just want to store JSON documents (files!) anywhere (git with webinterface, webserver, Document stores, …), we can imagine a progressive schema which adapts to our needs.
The more API consumers exist, the more we have to take care about breaking changes.

Replication and sharding will then assure most of the data is available most of the times, scaling requests and storage horizontally between nodes. Thinking of a data federation tells us many questions within this field of data privacy, authority, ownership and the likes are not yet answered completely.

Yet there are examples which tell us a way to go:

  • Merkle-DAGs (like git, dat, ipfs, forkdb) provide the ease of versioning we need
  • Remote Forking and Pull Requests are nowhere standardized between implementations, but federated wiki for the first and (private) webmention for the second show how it could work one day.
  • Event-based, social architectures of streaming data make assumptions about how to solve Identity. SoLiD and Activity Streams 2.0 are examples of such.

To subsume:

There are many ways of storing data, but our anticipated query scenarios constrain the perspectives of an evaluation grid and filter out overscaling or undersupported environments.

The focus of transformaps should still be to engage communities in the discussions about economic decentralization, linked data models and reference implementations of different approaches to linking data. Thus we care more about the vocabularies to be researched and possible counterpartners in a testbed, rather than inscribing a global normalisation to the field of federated civic geodata publishers.

I still suppose there will be many different data stores to be built upon. A geoindex in a replicated database (CouchDB + GeoCouch) may be interesting, but as interesting is a properly versioned store that is easily distributable to end-users (dat) or a social graph of individuals and organizations working with us. Also, the more events we process and get to publishing streams, performance and intermediate caching become more important.

Let’s just assume any geodatabase and not specify it. What is important is how we access it and make its data available to the public. This is most probably going to be a custom API in front of it.

The big picture design is what we are lacking constantly, but probably this approach is doomed to fail right from the beginning: Do we start imagining the perfect solution or do we collect existing building stones and only go one step further at a time, assuming probably colliding development futures may either collapse into one or differentiate into multiple. Given the restricted resources, we are working on many layers, not only technical, at the same time and thus produce the image of a stalled progression.

Please review the complexity yourself by digging into

I am currently reviewing our Taiga Project for associated user stories. There are four predominant layers in the current anticipation of what we need. I will go through each of them and put them into context with latest updates available to me.

Public/Private API + geo-aware backend

@toka and me are pretty much in line with each other this means a simple Node.js daemon which redirects to a GeoCouch for bbox and regular queries, but adds the thin authentication layer we need. Helpful resources could be:

  • Hoodie which provides basic libraries like .account, .store and .email which could be made geo-aware by using their PouchDB integration for a client-side geo index for a Progressive Web App or a GeoCouch extension for the web service. The associated .geo library would work both on client and server.
  • This rest-api template from a thinkfarm friend.
  • How to implement different versioning strategies with CouchDB

Map Editor

There is common consensus about uMap being the closest to an imaginable editor we could long for. Also we cannot build a complete WebGIS from the bottom-up nor design it top-down, thus SHOULD build on the strengths of open collaboration and COULD declare uMap our reference implementation of a collaborative web mapping application.

We tear it apart from the middle-out and extract a usable Leaflet.Storage abstracted enough for use within multiple mapping applications, linking different frontends to different backends. A missing thin waist of current WebGIS offerings and another step to a more geo-aware web in general.
We could finally be approaching a geo transport layer which increases interoperability between multiple implementations

What uMap currently lacks to be enough for us is

  • an official Dockerfile for increased self-hostability.
  • an API to expose public and private data
  • taxonomy viewer, templates + editing features, but these are maybe just another coupled service. Federated Wiki comes into mind. An explanation about how this could look like is given upon request.
  • a simple spreadsheet data collection interface - the view exists already, it just needs to be directly linkable in the JS frontend and could need tighter integration with a free geocoding service à la Mapzen’s.

Most of these features would be offered by CartoDB instead.

Map of Maps

Since one of the Scrum standups we know @species managed to harvest the Semantic MediaWiki and create an overview map of the mappings. Unfortunately we don’t know yet how it came into existence, as it is completely undocumented.

The existing work on importing tabular spreadsheet data and transforming could then be integrated into more generalized, yet not neccessarily automatized workflow examples. Once we have storage in place there is plenty of ways to interact with it from the terminal, client-side JavaScript applications or any other JSON-via-HTTPS consumer.

Map Mashups

So we come to an end. By building on high quality open source software and integrating into existing ecosystems, we build on the work of thousands before us and only add minimal layers of patches which add our desired functionalities. We even managed to make sure users keep control over their data and can revoke any publication permissions at any time. If we federated the dataset before and it had been licenced accordingly, we probably still have other copies floating around. Now we want to show our multitude of views on alternative economies in as many places as possible.

From uMap and other websites we already know <iframe /> embeds, but Discourse already shows us how lovely oneboxing is.

How does it work? It extracts Structured Data from the Web (click on the first undefined type error and check out world’s probably first use of the ESSGlobal vocabulary by The Institute for Solidarity Economics) and displays it accordingly for known vocabularies.

As we also know about half of the world’s websites are powered by WordPress, we could use it to distribute our self-hosted vision of location-based sustainability data and produce a small mapping plugin which @species already started creating. But if we think of loosely coupling different webservices, we can also imagine to create a [shortcode] plugin directed at uMap. Inserting a Mapping Viewer or Editor could become as simple as pasting a URL in Discourse, too!


What else is there to discuss?

1 Like

@maxlath,

could you describe, what your decision was based on for using couchDB for your project inventaire.io?
Did you also consider PostgreSql and MongoDB or any other in the list previously described by @almereyda: https://tree.taiga.io/project/transformap/task/206 ?


Following some of the linked data discussions, IMHO I think everyone agrees on that. It would be good to develop something, that does not block the development towards linked data, and also is a first approach, but it will not be the final interation towards it.


On last friday 4 people invested more than 2 hours to collectively elaborate ideas about a data storage, or as I understood some data storages with different applications.

When it comes to the database to store our POIs, as well as the different taxonomies, that should merge with each other, can we assume, that @almereyda would favour CouchDB & CartoDB, as these are the both you mentioned in your elaborate and in depth post.


At the other programmers @maxlath @pmackay @mattw @species could you give feedback on the descriptions of @almereyda, who unfortunately could not make it to the call last week.

hi!
My choice of CouchDB for inventaire was very much determined by my learning path as explained in this issues, and, unfortunately, I have very limited experience with other databases. What I can say then is that the combo CouchDB for master data and LevelDB (and its many extensions) for cache/secondary indexes works quite well for me so far, and is very pleasant to work with. I just outsourced proximity queries to LevelDB thank to the level-geospatial plugin, and could probably do some graph queries later thank to the levelgraph plugin or even go json-ld thank to levelgraph-jsonld. (To get more on this modular database approach, I would recommand Rod Vagg’s talk Introducing LevelDB).

My sole frustration with CouchDB is the plugin system: I made a cross on GeoCouch as I had all the pain to install it properly and couldn’t wish all contributors/self-hosters to have to go through all this, but that’s were @almereyda’s love for containers could totally make sense :wink:

@maxlath
thank you for your description, especially the description of your journey on the github link! The spotify link is interesting, seems to definitely put PostgreSQL on a no-go.

what was the pain, and did it work in the end? Or why not?


@almereyda That might be an interesting conversation in Witzenhausen.
Or CartoDB?

@maxlath how do you see your possibilities in contributing to another Database System than CouchDB? E.g. CartoDB?

the pain was that it isn’t very well integrated: you can’t just apt-get install geocouch or what I saw in the elasticsearch world where you can do things as simple as elasticsearch/bin/plugin install mobz/elasticsearch-head, you have to build couchdb from source with the right version, as it doesn’t work with all versions, and copy geocouch files in some directory, with little guarantee that you are doing the right thing until you can see the final result

I don’t know CartoDB but if other contributors know how to use it, I guess I could learn :slight_smile:

Which link? If you talk about a link, please include/link it! I couldn’t find any reference to spotify in maxlat’s links.

the article mentionned in the issue:
Switching user database on a running system - labs.spotify.com

Please be aware CartoDB is a web mapping platform that uses PostgreSQL with the PostGIS extension. It is nowhere near being a database itself.

I would oppose that strongly. We are not in the situation that Spotify is with 75M users, especially users that are extremely sensitive to outages (as they are paying for the service). OSM has 2.3 M users, and its DB configuration is very similar to Spotify’s: 1 write-DB, many read-DBs. If there is an outage in the write-DB, then you can’t edit, that’s true. But all other services remain unaffected. That’s a problem that is just soo far away in the future (to guarantee 100% edit-availability), that in our situation with very limited funds, I would not care about in the moment.

So basically CartoDB is an application, that integrates with PostgreSQL as database?

In very easy terms, that would mean CartoDB = (building on) PostgreSQL or would there also be other PostgreSQL uses, that would not integrate CartoDB?


@almereyda,
what do you think about the PostgreSQL write/read setup?


personally from an absolute non-technician standpoint, I share:

At first, as much as I appreciate your insights, it would have been very helpful if you would have been attending the meeting, especially as you signed up in the framadata. /rant off.

If we have to implement them ourselves, then I agree. If a solution provides it out-of-the-box, we can take it now.

could you specify this in detail?

Good point - I will add it to the requirement criteria matrix.

+1 for step-by-step.


Thanks for the reminder, I’ve documented it in this thread.


I’ve started the requirement matrix here in this ethercalc (Just accept the expired certificate).

Most high-end FOSS GIS-solutions build upon postgres/postgis, e.g. also the OpenStreetMap stack (the so-called “rails port” I would suggest as POI DB) uses postgis both for the main DB as well as for the rendering servers.

I was one of those people, but I am uneasy about the value (and the interpretation here) of my contribution, because I do not understand properly the use to which this database will be put. I seem to have entered the discussion at a point where all that needs to be done is to refine the implementation details, and where the “higher” level questions about use cases and //why// we are doing what we are doing have already been answered.

For example, in this thread (or links from it), I have found the following:

@species in https://tree.taiga.io/project/transformap/us/71

“Scales for at least 100 million POIs”

This sounds like a big centralized database. I may be wrong.

@gandhiano in https://tree.taiga.io/project/transformap/task/206:

“For me having a decision on the geodatabase to use is necessary until the next hackathon begins on Feb 8th. It is a precondition for allowing a fruitful work with maxlath and to allow us to move forward in shaping and implementing the TM stack.”

The two comments above cannot both be describing the same instance of a geodatabase, as the first requires a decision by Feb 8th.

In the absence of proper understanding, I have come up with some hypotheses:

  • Guess 1: Perhaps my lack of understanding is down to the fact that I am quite new here. If I keep digging through discourse/taiga/mediawiki/1Xmmm/github issues, then I will find the answers.

  • Guess 2: There are two paths being followed: The first is the long term view as expressed by @almereyda. The second is to extend the OSM demo further, partly to work around the limitations of OSM. If this guess is correct, then it would be really useful to be explicit about which path is being followed when the two paths cause conflicting conclusions.

  • Guess 3: We need to get experience of geodatabases in order to realize our long term goals, so let’s start playing with them now in order to get that experience. This will eventually feed back into the long-term vision.

Are any/all of the guesses right?

Is there an external force (e.g. commercial/funding/deliverables etc) that is driving this decision? If so, it would be really useful to acknowledge it because these things usually have an impact on the technology under construction, and it gets really confusing if the existence of such external forces is not made explicit.

On the subject of funding, my participation here is funded by the Institute for Solidarity Economics. One of our goals is to empower initiatives within the Social and Solidarity Economy to get them selves “on the map” (in the widest sense) by providing data about themselves. We aim to do so by eventually providing tools and reference implementations that can drive the creation of distributed Linked Open Data. We do not want to create a centralized database. We want to use the standards and recommentations as described in Linked Data: Evolving the Web into a Global Data Space. We will use (and adapt where necessary) the application profile described by ESSglobal. We hope that in time geographic mapping is but one example of many applications that use this open data. We have a lot of work to do in order to meet our goal, starting with a more detailed look at ESS global, and experimenting with publishing linked data on the web.

So, our goals are closely aligned with the view expressed here:

1 Like

Hi @mattw,

sorry for the long delay, the Witzenhausen Hackathon was quite intensive, and we all needed a little rest.

Each of the “forces” in the TransforMap community has a little different opinions about “our” database, for what it should be used, what should be stored in it. There was only one common requirement: “we need a database”. The primary use would be to store geo-data, but we also need a graph-database where can store our (and others) Taxonomies and interlink them.
The meeting you attended (the goal was to define requirements for a DB) brought us a big step forward on aligning our requirements inside the TransforMap community.

Because “what TransforMap is” is such a big topic, there are several things that are expected that TransforMap should provide:

  • Provide a Point of Interest (or places as we called them during the hackathon) storage for communities, that don’t have or want to develop and take care of a mapping system themselves.
  • Provide an “extra tag storage” where one can annotate POIs stored elsewhere with his own tags (this is one of my personal requirements inside TransforMap) - see my proposal here.
  • Provide a big “cache”, where the POI data of other sites and mapping systems are cached, as backup and for performance reasons. This “cache” or ETL (Extract-Transform-Load)-Hub as @almereyda calls it should also provide a common API for different data sources.
  • Create a machine-readable and interlinked Taxonomy system, where different ontologies are stored and linked together (this system was not discussed in the last meeting).
  • And for sure something else I forgot others expect from TransforMap :wink:

As I am used to work with worldwide datasets at this scale (OpenStreetMap), I only wanted to include future-proof, scalable solutions here. No pun against decentralisation intended here.

I would say it is more our lacking documentation of our two-year-long process that makes it difficult to follow :wink:

Actually, these two should merge in the long term. The approach of the demo maps is currently to use OSM as primary datasource (but it is theoretically able to query our recently set up new API too).

We need at least some playground for our API to work on, that is also true.


Yes, there are actually some deadlines, some from the CHEST project, and some for the SUSY (or SSEDAS) project, where some members of TransforMap are active to provide the mapping part for SSEDAS with TransforMap technology.

Especially for SSEDAS we need to provide a POI storage, because not everyone of the 26 partners involved have or can manage his own database of POIs.

2 Likes

During

I had also been made aware of the possibility of Wikibase as a geodata store by a geo visualisation of the query interface.