Geodata storage evaluation


(Josef Kreitmayer) #1

@gandhiano, did you already start developing the database evaluation grid for TransforMap?
@species agreed to make a first draft.

@maxlath @mattw @pmackay your input is also highly welcome.

here the task started by @almereyda: https://tree.taiga.io/project/transformap/task/206

Probably here is some ideas, please bare in mind, that I have no idea, and just searched for “database evaluation matrix”


(Michael Maier) #2

I think we should start evaluating the requirements for a geodatabase for TransforMap:

  • Open Source,
  • UTF-8
  • should allow key-value storage of POI attributes, no fixed columns
  • history for all objects (with linking to previous versions possible)
  • accessible via https
  • has to provide unique identifiers for each object -> static links that lead to the most recent version
  • has to allow geo-queries (Bounding-Box/Nearby searches)
  • http-interface that returns JSON
  • user accounting for individual users
  • oauth/2 for users
  • RDF capable (?)

nice to have:

  • webinterface
  • for looking up individual datasets and their history
  • user management via web
  • integrated (web) editor
  • possibility to sync (diffs) to mirrors

Suggestions? Other requirements? @gandhiano, @almereyda, @maxlath ?


(Matt Wallis) #3

I’m not an expert on databases, but that won’t stop me offering some views :smile:

I should also declare a bias towards Linked Data / Semantic Web. I’m not an expert on these either, but I am convinced (until proven otherwise) that these offer the best route for open data applications on the web. Of course that doesn’t mean that one has to use RDF stores, but it did lead me to the following well presented research into geospatial RDF stores:

The report is relevant not only to RDF stores, but has a good description of some general issues surrounding geospatially-enabled databases. It also provides some general introduction to Semantic Web and GeoSPARQL. It also compares the RDF stores against the Oracle Spatial and PostGIS DBMS systems, and provides plenty of food for thought (test cases) when deciding how to evaluate geospatially-enabled databases.

My own experience with (non-RDF store) databases would lead me to recommend PostgreSQL. ~ It was recommended to me by someone with decades of database experience, and I have always found it excellent. The report points out that the PostGIS extension to PostgreSQL supports conversions between different projections via the proj4 library. Might this re-projection capability also be a requirement, for example to allow input of location information that is presented in other projections? PostgreSQL provides hstore for key/value pairs.

@species, nice list! I’d find it helpful to also see why these requirements were in the list. Perhaps it is just my unfamiliarity with the project that means that they are not all obvious.

On a more general note, could someone explain to me where this geodatabase sits in the TransforMap architecture? Is it a ‘master’ or a cache?


(Josef Kreitmayer) #4

@mattw,

Postgresql was also mentioned by @almereyda as one of the options: https://tree.taiga.io/project/transformap/task/206

I just read a review, which favors postgresql to mongobd, where I especially find the comments interesting: http://www.aptuz.com/blog/is-postgres-nosql-database-better-than-mongodb/

One doubt with MongoDB and I suppose also CouchDB is the query function (I do not even understand, what I am writing, but just citing, what people see advantages in Postgresql).

In the link above:


(Josef Kreitmayer) #5

Interestingly enough, the look and feel of the mongoDB communication and the look and feel of the PostgreSQL feels like a comporison of wordpress to typo3.

appearance: https://www.mongodb.org/
community: http://www.meetup.com/de-DE/pro/mongodb/

appearance: https://wiki.postgresql.org/
community: http://2015.pgconf.de/

The first one widely applied, but initially intended for a very basic thing growing rapidly in related applications (speaking of wordpress), and the later a very well equipped system, but hardly applied with quite a small set of people able to customize it : )


(Paul Mackay) #6

@species reading your initial post, some of those requirements seem to be more than just a database. Is that how you would see it? Are you thinking here of a database-backed system that could do the storage but also other functions?


(Michael Maier) #7

@pmackay the decision about a specific database will heavily be influenced by the existing ecosystem around it.

We have some requirements that need to be handled anyway in TransforMap, as user management. If a database solution offers user handling with a webinterface/oauth, that would save us implementing that ourselves.

So I can answer yes, I don’t see it as a pure backend question.


(Michael Maier) #8

Thanks @mattw for your insights!

I also see this as the way to go.

I also would favor this solution, yes.

But I think we should agree on requirements first, before evaluating solutions.

Mainly I’ve looked at best practices from existing projects.

Some are simply needed for map services (geo-queries, http interface, user accounting), some are just standard requirements for the way things are done today (UTF-8, versioning, static links).

But I want do discuss each requirement :slight_smile:!

I would see it as a ‘master’ for POIs that cannot be stored (fully) in existing commons infrastructure (OSM, Wikidata), and also for storing additional information that doesn’t fit in to OSM. But that’s my personal view, others will have different requirements!


(Michael Maier) #9

In the today’s Scrum standup meeting, we thought that it might be a good idea to schedule a meeting for the requirement analysis for the TransforMap database.

Please add your availability here: https://framadate.org/KPXK44qk2eHco2k6

Calling @almereyda, @maxlath, @mattw, @pmackay, @toka, @gandhiano especially :slight_smile:
Any others interested, feel free to join!


(Michael Maier) #10

The poll shows a clear tendency for Tomorrow morning.

So let’s meet on Thursday, Feb 4th 11:00 CET in Mumble.

Thanks to the voters for your participation in the poll. Of course anyone interested can join too :slight_smile:.

See you tomorrow!


(Michael Maier) #11

I’ve set up a hackpad for this meeting:
https://notes.transformap.co/TransforMap-Database-Requirement-Analysis-Meeting-54WFdDx9tXl


(Michael Maier) #12

On today’s Mumble meeting about criteria for geodatabases, we’ve had the following insights:

  • “Master” database (for TransforMap POIs) and “cache” (ETL Hub) have different requirements.
  • We should probably split “caching” and “master” databases.
  • Requirements can be separated into “for the API” and the “pure database”.

So the requirements agreed on are:

Both systems:

  • Open Source,
  • Active community behind
  • good reputation, good security (difficult to hack)
  • UTF-8
  • should allow key-value storage of POI attributes, no fixed columns
  • accessible via https (encryption)
  • has to allow geo-queries (Bounding-Box/Nearby searches)
  • It is only needed for POI DB, when we don’t have the cache from the beginning
  • http-interface that returns JSON
  • Database ports have not to be exposed directly to the open Internet

For “master” POI storage of TransforMap

  • history for all objects (already provided as feature?) (with linking to previous versions possible)
  • has to provide permanent unique identifiers for each object (UUID) ->static links that lead to the most recent version
  • possibility to look up previous versions
  • replicating capability
  • ability to get data from less than one minute (nice to have)
  • diffs on files would be a performance gain (nice to have)
  • possibility to use either live data or cached data

For the “caching” database:

This was the Extract-Transform-Load-Hub (ETL) from CHEST.

  • modularity: design concept to easily integrate more data sources
  • performance should be able to scale up with size
  • User needs to see how out of date the cache is
  • write back capability - no, just fetches

API

all other requirements were API requirements. They were collected, but not discussed further because of time constraints. Will probably discussed in another meeting.

  • User accounting for individual users
  • oauth/2 / OpenID connect for users
  • Able to make SPARQL queries (e.g. to Wikidata)
  • Send request for updating/refreshing (single items) from the source database
  • would only work for some datasets (TM DB, OSM, Wikidata,…) that have an API
  • “expiring” of data

Nice 2 have:

  • webinterface
  • for looking up individual datasets and their history
  • user management via web
  • integrated (web) editor

The next step is to make a table with these requirements as columns, and to add the different database solutions as rows. @gandhiano hat some tools for this a time ago.

The decision on the final database(s) will be made at the beginning of the Witzenhausen Hackathon, probably Monday/Tuesday. See the poll here on Framadate.


(Matt Wallis) #14

@maxlath, @gandhiano, @species, @almereyda (did I miss anyone?), Good to talk with you people today. During the meeting, I was trying to remember an article I’d read last year. Finally I did: Who’s on first. I think this is an interesting read. A database of POIs and a gazeteer have a lot in common. There’s a discussion here in another thread.

On a separate issue: Is there anywhere I can read about the “big picture” design that includes the databases whose requirements were discussed today?


(Jon Richter) #15

First we have to acknowledge there is more discussion about future options than extending existing experiments, with the iD fork, the demo maps and the Semantic MediaWiki being the only self-hosted geo services right now.

Second working purely in the abstract, imaginative layer of specification of our needs, we are prone to the trap of mixing an implementation with the function requested from it.

Looking at how we have been overengineering until now, we can rest assured OAuth and SPARQL, even with geoextensions, are out of scope for the next half year.
Now we have to separate our mind again to distill further requirements regarding

  • Query- and Accessability
  • Storage and Data models
  • Failover and Recovery scenarios

We understand that in a world of Service-Oriented-Architecture and connected microservices, a lot of experience in distributed systems design is not directly available to us. We are struggling with separating Graph, Column and Row stores, not to mention external indexing services and multi-model or only partly open source databases.

When it comes to Querying, we see the predominant *QL dialects, but also semi-proprietary approaches and far distant futures with Linked Data Fragments. Yet we want to encourage users to host themselves and only publish what they want. So we are agnostic about how users offer us their data. Still we have to make a decision for ourselves.

A custom web service may suffice for now and could be extended to offer further dialects. In the beginning we are probably not even getting the linked data part right, as we never discussed any JSON-LD contexts for the models needed. But as we probably just want to store JSON documents (files!) anywhere (git with webinterface, webserver, Document stores, …), we can imagine a progressive schema which adapts to our needs.
The more API consumers exist, the more we have to take care about breaking changes.

Replication and sharding will then assure most of the data is available most of the times, scaling requests and storage horizontally between nodes. Thinking of a data federation tells us many questions within this field of data privacy, authority, ownership and the likes are not yet answered completely.

Yet there are examples which tell us a way to go:

  • Merkle-DAGs (like git, dat, ipfs, forkdb) provide the ease of versioning we need
  • Remote Forking and Pull Requests are nowhere standardized between implementations, but federated wiki for the first and (private) webmention for the second show how it could work one day.
  • Event-based, social architectures of streaming data make assumptions about how to solve Identity. SoLiD and Activity Streams 2.0 are examples of such.

To subsume:

There are many ways of storing data, but our anticipated query scenarios constrain the perspectives of an evaluation grid and filter out overscaling or undersupported environments.

The focus of transformaps should still be to engage communities in the discussions about economic decentralization, linked data models and reference implementations of different approaches to linking data. Thus we care more about the vocabularies to be researched and possible counterpartners in a testbed, rather than inscribing a global normalisation to the field of federated civic geodata publishers.

I still suppose there will be many different data stores to be built upon. A geoindex in a replicated database (CouchDB + GeoCouch) may be interesting, but as interesting is a properly versioned store that is easily distributable to end-users (dat) or a social graph of individuals and organizations working with us. Also, the more events we process and get to publishing streams, performance and intermediate caching become more important.

Let’s just assume any geodatabase and not specify it. What is important is how we access it and make its data available to the public. This is most probably going to be a custom API in front of it.

The big picture design is what we are lacking constantly, but probably this approach is doomed to fail right from the beginning: Do we start imagining the perfect solution or do we collect existing building stones and only go one step further at a time, assuming probably colliding development futures may either collapse into one or differentiate into multiple. Given the restricted resources, we are working on many layers, not only technical, at the same time and thus produce the image of a stalled progression.

Please review the complexity yourself by digging into

I am currently reviewing our Taiga Project for associated user stories. There are four predominant layers in the current anticipation of what we need. I will go through each of them and put them into context with latest updates available to me.

Public/Private API + geo-aware backend

@toka and me are pretty much in line with each other this means a simple Node.js daemon which redirects to a GeoCouch for bbox and regular queries, but adds the thin authentication layer we need. Helpful resources could be:

  • Hoodie which provides basic libraries like .account, .store and .email which could be made geo-aware by using their PouchDB integration for a client-side geo index for a Progressive Web App or a GeoCouch extension for the web service. The associated .geo library would work both on client and server.
  • This rest-api template from a thinkfarm friend.
  • How to implement different versioning strategies with CouchDB

Map Editor

There is common consensus about uMap being the closest to an imaginable editor we could long for. Also we cannot build a complete WebGIS from the bottom-up nor design it top-down, thus SHOULD build on the strengths of open collaboration and COULD declare uMap our reference implementation of a collaborative web mapping application.

We tear it apart from the middle-out and extract a usable Leaflet.Storage abstracted enough for use within multiple mapping applications, linking different frontends to different backends. A missing thin waist of current WebGIS offerings and another step to a more geo-aware web in general.
We could finally be approaching a geo transport layer which increases interoperability between multiple implementations

What uMap currently lacks to be enough for us is

  • an official Dockerfile for increased self-hostability.
  • an API to expose public and private data
  • taxonomy viewer, templates + editing features, but these are maybe just another coupled service. Federated Wiki comes into mind. An explanation about how this could look like is given upon request.
  • a simple spreadsheet data collection interface - the view exists already, it just needs to be directly linkable in the JS frontend and could need tighter integration with a free geocoding service à la Mapzen’s.

Most of these features would be offered by CartoDB instead.

Map of Maps

Since one of the Scrum standups we know @species managed to harvest the Semantic MediaWiki and create an overview map of the mappings. Unfortunately we don’t know yet how it came into existence, as it is completely undocumented.

The existing work on importing tabular spreadsheet data and transforming could then be integrated into more generalized, yet not neccessarily automatized workflow examples. Once we have storage in place there is plenty of ways to interact with it from the terminal, client-side JavaScript applications or any other JSON-via-HTTPS consumer.

Map Mashups

So we come to an end. By building on high quality open source software and integrating into existing ecosystems, we build on the work of thousands before us and only add minimal layers of patches which add our desired functionalities. We even managed to make sure users keep control over their data and can revoke any publication permissions at any time. If we federated the dataset before and it had been licenced accordingly, we probably still have other copies floating around. Now we want to show our multitude of views on alternative economies in as many places as possible.

From uMap and other websites we already know <iframe /> embeds, but Discourse already shows us how lovely oneboxing is.

How does it work? It extracts Structured Data from the Web (click on the first undefined type error and check out world’s probably first use of the ESSGlobal vocabulary by The Institute for Solidarity Economics) and displays it accordingly for known vocabularies.

As we also know about half of the world’s websites are powered by WordPress, we could use it to distribute our self-hosted vision of location-based sustainability data and produce a small mapping plugin which @species already started creating. But if we think of loosely coupling different webservices, we can also imagine to create a [shortcode] plugin directed at uMap. Inserting a Mapping Viewer or Editor could become as simple as pasting a URL in Discourse, too!


What else is there to discuss?


TransforMap.co landing page
evaluation of editors from projects in the mapping of mappings collection
(Josef Kreitmayer) #16

@maxlath,

could you describe, what your decision was based on for using couchDB for your project inventaire.io?
Did you also consider PostgreSql and MongoDB or any other in the list previously described by @almereyda: https://tree.taiga.io/project/transformap/task/206 ?


(Josef Kreitmayer) #17

Following some of the linked data discussions, IMHO I think everyone agrees on that. It would be good to develop something, that does not block the development towards linked data, and also is a first approach, but it will not be the final interation towards it.


On last friday 4 people invested more than 2 hours to collectively elaborate ideas about a data storage, or as I understood some data storages with different applications.

When it comes to the database to store our POIs, as well as the different taxonomies, that should merge with each other, can we assume, that @almereyda would favour CouchDB & CartoDB, as these are the both you mentioned in your elaborate and in depth post.


At the other programmers @maxlath @pmackay @mattw @species could you give feedback on the descriptions of @almereyda, who unfortunately could not make it to the call last week.


#18

hi!
My choice of CouchDB for inventaire was very much determined by my learning path as explained in this issues, and, unfortunately, I have very limited experience with other databases. What I can say then is that the combo CouchDB for master data and LevelDB (and its many extensions) for cache/secondary indexes works quite well for me so far, and is very pleasant to work with. I just outsourced proximity queries to LevelDB thank to the level-geospatial plugin, and could probably do some graph queries later thank to the levelgraph plugin or even go json-ld thank to levelgraph-jsonld. (To get more on this modular database approach, I would recommand Rod Vagg’s talk Introducing LevelDB).

My sole frustration with CouchDB is the plugin system: I made a cross on GeoCouch as I had all the pain to install it properly and couldn’t wish all contributors/self-hosters to have to go through all this, but that’s were @almereyda’s love for containers could totally make sense :wink:


(Josef Kreitmayer) #19

@maxlath
thank you for your description, especially the description of your journey on the github link! The spotify link is interesting, seems to definitely put PostgreSQL on a no-go.

what was the pain, and did it work in the end? Or why not?


@almereyda That might be an interesting conversation in Witzenhausen.
Or CartoDB?

@maxlath how do you see your possibilities in contributing to another Database System than CouchDB? E.g. CartoDB?


#20

the pain was that it isn’t very well integrated: you can’t just apt-get install geocouch or what I saw in the elasticsearch world where you can do things as simple as elasticsearch/bin/plugin install mobz/elasticsearch-head, you have to build couchdb from source with the right version, as it doesn’t work with all versions, and copy geocouch files in some directory, with little guarantee that you are doing the right thing until you can see the final result

I don’t know CartoDB but if other contributors know how to use it, I guess I could learn :slight_smile:


(Michael Maier) #21

Which link? If you talk about a link, please include/link it! I couldn’t find any reference to spotify in maxlat’s links.