Introduction

freetable.org is a database with an SQL like interface on the public Internet, with all the ownership, permissions, revision control, and abuse protections such an environment requires.

The goal of the FreeTable project is to create a public commons for data, in much the same way as Wikipedia provides a public commons for encyclopedic text. Rather than trying to carefully design the database tables that are supported, FreeTable allows programmers to create any table, and then allow use by other programmers to determine which tables are popular.

FreeTable seeks to be a repository of the important Facebook, Twitter, EBay, Craigslist, Match.com, and Yelp, like data sets, not just the West Valley Junior High baseball score spreadsheets, although we can support those too. We aren't fully there in terms of scalability today, but we will work aggressively in this direction. Contact us if you have any scalability concerns so we can work on addressing them.

Why now?

What FreeTable seeks to do is surprisingly feasible, even at Internet scale. A single Amazon EC2 c1.xlarge instance for example can support approximately 1,400 select requests per second. For comparison a top 10 website gets about 30m page views/day, or 350 page views a second. Thus depending on the work load 0.25 machines would be sufficient for the traffic from a top 10 website performing a FreeTable request on every page view. And the problem of scaling up from there partitions nicely. The front end processing, which takes the bulk of the time, can be spread across multiple machines. It is also possible to partition across database machines based on table name, or for really popular tables separate out multiple read only database machines from a writable master. Thus with only a handful of machines it is possible for FreeTable to operate at Internet scale. A c1.xlarge instance costs around $400/month, so the cost to support a top 10 website would be perhaps $100/month. The only caveat that is necessary is that, for FreeTable to successful at what it does, all of the data needs to fit in RAM, not disk. Disk is the new tape. And FreeTable has decided to sidestep the issues that come from attempting to optimize disk performance. With memory capacities of even laptops today exceeding 4GB, the range of useful datasets that FreeTable can handle is large and ever increasing.

Let's talk latency. FreeTable is currently located in the Amazon EC2 us-east-1 region. A majority of Internet servers are likewise located on the US East Coast. Today the cost of going to Amazon and back over the commercial Internet from a location on the US East Coast is perhaps 5ms. Add another 2-3ms for FreeTable to perform the lookup, and you have 8ms. This is the same or better than it would be to retrieve the data from a local disk, 10-15ms. Latency can never be too small, but the latency of using FreeTable is small enough. A warning though, from the US West Coast round trip latency might be 30-80ms. For certain interactive applications with servers located on the US West Coast this latency could be prohibitive. Similarly for FreeTable users with servers located overseas. The long term solution is to locate redundant FreeTable servers geographically, but this has yet to be done. A second warning, using HTTPS rather than HTTP to connect to FreeTable involves a significant compute time and a second round trip on connection establishment, so it is particularly important to minimize latency by using persistent connections when using HTTPS.

A quick tour

The easiest way to get started with FreeTable is to try it out. The FreeTable query language is called FQL, and it has an SQL inspired syntax.

The global.iana.tlds table is a well known table in FreeTable that stores information on the top level Internet domains. We will demonstrate how to perform the very basic query

    select * from global.iana.tlds where type="country-code" and purpose="Egypt"
on FreeTable in a variety of languages.

  • Web browser:

    Navigate to:

            http://db.freetable.org/ft?query=select+*+from+global.iana.tlds+where+type%3d"country-code"+and+purpose%3d"Egypt"
    
    This is really just a party trick, although it does show the workings of FreeTable up close. The resulting page may invoke a file download dialogue rather than displaying the results directly, This is because it has MIME content type application/json.
  • Unix shell:
        #!/bin/sh
        QUERY='select * from global.iana.tlds where type="country-code" and purpose="Egypt"'
        curl --data-urlencode query="$QUERY" http://db.freetable.org/ft
    
  • Python:
        #!/usr/bin/python
        import urllib, urllib2
        query = urllib.urlencode([('query', 'select * from global.iana.tlds where type="country-code" and purpose="Egypt"')])
        ft = urllib2.urlopen('http://db.freetable.org/ft', query)
        print ft.read()
    
  • PHP:
        #!/usr/bin/php
        <?
        $query = urlencode('select * from global.iana.tlds where type="country-code" and purpose="Egypt"');
        echo file_get_contents('http://db.freetable.org/ft?query=' . $query);
        ?>
    
  • Ruby:
        #!/usr/bin/ruby
        require 'net/http';
        require 'uri'
        query = URI.encode('select * from global.iana.tlds where type="country-code" and purpose="Egypt"')
        http = Net::HTTP.new('db.freetable.org')
        response, data = http.get('/ft?query=' + query)
        puts data
    

The output of these scripts will be JSON that in every case looks very close to the following:

    {"error": null, "result": [

    {"_id": 73, "_rev": 1, "type": "country-code", "purpose": "Egypt", "domain": "eg", ...},

    {"_id": 271, "_rev": 1, "type": "country_code", "purpose": "Egypt", "domain": "\u0645\u0635\u0631", ...},

    ], "stats": {"bytes_in": 72, "bytes_out": 522, "compute_units": 0.034163, "estimated_price": 0.000035279}}

The first thing to notice is the "result" value which shows Egypt has two top level domains, "eg" and a second internationalized domain name represented in JSON as escaped Unicode characters that denote Arabic code points.

Next notice the "_id" and "_rev" fields. "_id" is unique integer identifier for a particular row in the table that can be used as a handle on that row. "_rev" uniquely identifies revisions to a particular row, and allows versioning, roll back, and roll forward of a particular row's value.

Finally notice the performance statistics and estimated price for the query. The data in FreeTable is free as in speech, but not necessarily as in beer. Each user has a free quota of the amount of resources they can consume. To prevent abuse and make heavy users share in the cost of operating FreeTable usage above this free quota is billed. The "estimated_price" field shows the price for performing the query ignoring this free quota and any possible discounts.

The FreeTable Command Line Interface

FreeTable includes two Command Line Interfaces (CLI). A JavaScript based web CLI, and a Python based CLI. Both are very similar and provide for editing and history of FreeTable commands, and pretty printing of FreeTable results. The Python based CLI is slightly nicer, since it doesn't require switching back and forth between the keyboard and the mouse, but it requires installing the FreeTable client package available from the the developers page of freetable.org.

The web based CLI is available on freetable.org by following the link to "console".

The Python CLI can be invoked as:

    ./freetable
Commands can be entered into either CLI and terminated by semicolons:
    freetable> select * from global.iana.tlds where type="country-code" and purpose="Egypt";
And output similar to that seen previously is displayed.

Abuse model

To prevent abuse FreeTable requires users authenticate themselves before being allowed to perform certain FreeTable operations. This includes table creation, deletion, and meta-data update operations. Whether an anonymous user can insert, update, or delete data in an existing table depends on the table permissions set by the table curator. It is especially risky to allow anonymous users to update or delete data in a table. Therefore tables that do so are typically curated. A curated table is one in which updates and deletes are not made immediately visible, but first require the approval of the table curator.

Anonymous users can perform select operations on most tables.

Table names

Database table names in FreeTable, such as global.iana.tlds, have two parts. A tablespace, in this case "global", and a table name within that space, "iana.tlds". A normal user can only create tables in the global and the local tablespaces, where the local tablespace corresponds to the user's username. If no tablespace is specified, local is used.

Creating an account

So far we have only seen how to perform select queries that read and return data from FreeTable. To go further it is necessary to create a FreeTable account. A registered user is both able to perform operations that an anonymous user can't, and has a much bigger free quota than an anonymous user.

To create a FreeTable account navigate to freetable.org and click on the "create account" link. Fill out the requested information, being sure to select "API access", and submit the form, and you will end up with a username and password that is good for both accessing the FreeTable website, and programatically accessing the FreeTable database db.freetable.org.

Accessing FreeTable robustly

We have cheated a bit, some of the examples of how to access FreeTable shown previously work only for select statements. This is because for many of the examples we used HTTP GET requests, but for most queries FreeTable requires HTTP POST requests.

To access FreeTable more generally, it is recommended you make use of the appropriate FreeTable client library for your language. Libraries for a variety of languages can be found on the developers page of freetable.org.

In addition to using POST, the well written client libraries make use of HTTPS for password security, and persistent connections to reduce latency.

The examples that follow assume we chose the username "testuser" and the db access password "k3e9ijs2". Suppose we are going to be working in the future on a table called "ad".

Let's see how to use the client library, and to make things concrete let's use a query to delete any existing ad table. The client library will use a HTTPS connections so that the password isn't transmitted in the clear, although for some languages this may not be possible.

  • Python:
        #!/usr/bin/python
    
        import freetable
    
        user = 'testuser'
        password = 'k3e9ijs2'
    
        ft = freetable.FreeTableClient(user, password)
    
        result = ft.execute('drop table if exists ad')
            # OK to ignore result.
    
  • PHP:
        #!/usr/bin/php
        <?
    
        require_once('FreeTableClient.php');
    
        $user = 'testuser';
        $password = 'k3e9ijs2';
    
        $ft = new FreeTableClient($user, $password);
    
        $result = $ft->execute("drop table if exists ad");
            # OK to ignore $result.
    
        ?>
    
  • Ruby:
        #!/usr/bin/ruby
    
        require 'freetable'
    
        user = 'testuser'
        password = 'k3e9ijs2'
    
        ft = FreeTableClient.new(user, password)
    
        result = ft.execute("drop table if exists ad")
          # OK to ignore result.
    

Creating a table

Imagine we wish to create a service that syndicates classified ads much like those posted to Craigslist. Unlike Craigslist, which attempts to forbid the copying of ads, the ads in our FreeTable database would be free for anyone to redistribute and display, as well as contribute to. This example is intended to explore the capabilities of FreeTable, not to be a full blown classified ad syndication service.

The database will have a single table called testuser.ad. We could create the database using the following code. We will use Python for concreteness, but translation into another language would be trivial.

    #!/usr/bin/python

    import freetable

    user = 'testuser'
    password = 'k3e9ijs2'

    ft = freetable.FreeTableClient(user, password)

    result = ft.execute('''create table ad (
        location string not null comment "Location of the item",
        category string not null comment "Category of the item (cars, furniture, books)",
        price float not null comment "Price in local currency",
        title string not null storage json comment "Brief one line title",
        description string not null storage json comment "Description of the item",
        email_address email comment "Email contact address, or null if none"
    ) license_ids_required=[["Public_Domain"], ["CC-by-3.0"], ["CC-by-sa-3.0"], ["ODC-DbCL-1.0"]]
''')

The column data types in FreeTable correspond to those of JSON: boolean, integer, float, string (Unicode), array (list/tuple), object (hash/dict/associative array), and json (arbitrarily typed data); plus two special types: email (for securely storing email addresses), and secret (for securely storing passwords).

The json storage type declarations for title and description require a little explanation. Internally fields are potentially stored by FreeTable twice, first within a JSON string for the whole row, and second directly in memory in a native binary representation. The first form allows select statements to execute rapidly without having to unpack or repack JSON data. The second form allows the efficient implementation of where clauses, indexes, and aggregates. By default booleans, integers, floats, and strings are stored in both memory and as JSON string, while arrays, objects, and jsons are stored only as JSON strings. The storage json declaration above says to store the title and description strings only in the JSON. This will save RAM. Overriding the default storage type for columns should only be done for tables that are expected to become quite large, perhaps 100Mbytes or larger.

The license_ids_required field sets the allowed copyright license types for inserted row elements and will be discussed more later.

Uploading data

Insert can be used to upload data:
    import json

    listings = (
	{
	    'location'        : 'San Francisco',
	    'category'        : 'cars',
	    'price'           : 100.00,
	    'title'           : '1966 Dodge-veg-o-matic',
	    'description'     : 'Wheels fallen off.',
            'email_address'   : {'email': 'fred@internet.com', 'price': 0.01},
	    '_license_ids'    : ['Public_Domain']
	},
	{
	    'location'        : 'Liverpool',
	    'category'        : 'cars',
	    'price'           : 80.00,
	    'title'           : 'Mini Cooper',
	    'description'     : 'Car features genuine wood paneling.  Cheap cheap cheap.',
            'email_address'   : {'email': 'jane@internet.com', 'price': 0.01},
	    '_license_ids'    : ['Public_Domain']
	},
	{
	    'location'        : 'Adelaide',
	    'category'        : 'cars',
	    'price'           : 10.00,
	    'title'           : 'A real pig of a car',
	    'description'     : 'Used BMW.  Cheap cheap cheap.',
            'email_address'   : {'email': 'tom@internet.com', 'price': 0.01},
	    '_license_ids'    : ['CC-by-sa-3.0'],
            '_license_notices': u'Copyright \N{Copyright Sign} 1984 Eric Blair.'
	}
    )

    result = ft.execute('insert into ad values ?', listings)

The insert command takes a JSON value, not a list of SQL like tuples. The JSON value can either be expressed inline, or as in this case using "?" positional parameters which packages up the supplied data as a JSON string in a way that makes SQL injection attacks difficult. The reason for using JSON rather than an SQL tuple is to make it easier to interface the database to programming languages. FQL makes use of JSON throughout. It has two main advantages. It avoids the need to marshal and unmarshal arguments before and after insert/update and select statements. And it is possible to select an object, programatically modify it, and then update the object back into the table.

Also note we are able to insert multiple data values at once.

Lastly the _license_ids field is required. FreeTable takes copyright seriously and tries to ensure any data uploaded to it is free in the free speech sense. This field is a declaration by the programmer that the data is in the first two cases in the public domain, and in the final case licensed under a Creative Commons share-alike license. In the final case an appropriate copyright and attribution notice have been attached to the data element using the _license_notices field.

Now, let's look at some of the results from the last statement:

    print ft.last_insert_id
    print ft.message
Which gives:
    3
    Curator approval required for changes to take effect.
last_insert_id is the value of _id for the last element just inserted. It isn't very useful when inserting multiple values at once, but is more useful when inserting a single value.

As can be seen from the message above the table we created is curated. It is good to remember this possibility, or more generally it is good to check the ft.error and ft.message values if FreeTable seems to not be performing your query. In the case of the Python client library, it automatically checks and raises an exception if FreeTable returned an error response.

Curation

To see the effects of curation:
    result = ft.execute('select count(*) from ad')
    print result
which gives:
    [{'count': 0}]
The returned value is an array in case the select returns multiple rows, and each element is an object in case we select multiple columns. In any case the rows we inserted do not currently appear in the table.

To approve all changes to date and make them publicly visible:

    result = ft.execute('update hidden_columns ad all_rows set {"_current": true, "_review_state": "accepted"}')
    result = ft.execute('select count(*) from ad')
    print result
which gives:
    [{'count': 3}]
The details of this update command are a bit obscure, but the important thing to note is now all of the rows are visible. Normally, the FreeTable web interface will also be used to approve changes, hiding this complexity.

Rather than approve each change, a different way to approach things is to turn off curation, but when we do so in order to prevent abuse we want to make sure anonymous users can no longer insert or update the table:

    result = ft.execute('''update information_schema.tables
        set {"curated": false, "perm": "siu--usius--"} where table_name = "ad"''')
Now when an insert or update is performed it will have an immediate effect on the table. The permission value used "siu--usius--" specifies full rights to everyone except anonymous users.

Permissions

Permissions on tables are specified as a sequence of four triples, one each for the table curator, record contact, other non-anonymous users, and anonymous users. Each tuple indicates whether that user has select, insert, and update/delete permissions as "s", "i", and "u", or no such permission as "-".

No semantics are currently assigned to the contact for a record having select or insert permission on that record, so those values are always "-". Some common permissions are "------------" for no permissions granted, "siu---s--s--" for select for everyone and table curator modification only, "siu-usiu---" for select for everyone and full non-anonymous user modification, and "siu--usiusiu" for full access. Full access normally only makes sense for curated tables.

The email column type

The email address stored in a column of type email is never revealed. Instead, suppose we wish to enter into a dialog with the owner of the second car inserted:
    result = ft.execute('''select email_address from ad where _id=2''')
    print result
This gives:
    [
        {
            "email_address": {
                "count": 1, 
                "duration": 86400, 
                "price": 0.01
            }
        }
    ]
Indicating the email address is available, but to retrieve it will cost US$ 0.01. To retrieve the email address specify parameters to the column:
    result = ft.execute('''select email_address({"price": 0.011}) from ad where _id=2''')
    print result
where the price we specify is an upper bound on the price we are willing to pay, and prevents problems with the price changing between the time of the original query and the time we pay to retrieve the email address. If all goes well we now get: This gives:
    [
        {
            "email_address": {
                "count": 1, 
                "duration": 86400, 
                "forward": "ft-5biw-9rgy-bb5f@fwd.freetable.org", 
                "price": 0.01
            }
        }
    ]
We now have 86400 seconds in which to send up to 1 email to ft-5biw-9rgy-bb5f@fwd.freetable.org and have it forwarded to the car's owner.

The fee, $0.01, will added to the user's account. It should be chosen low enough to allow legitimate email, but high enough to make spamming unprofitable.

Updating data

We happen to know that the _id of the first record inserted is 1. We can update this record using:
    result = ft.execute('''update ad
        set {"price": 90, "description": "Three wheeler.  Selling this car cheap."} where _id = 1''')

There is another common idiom for updating records. Suppose we have previously retrieved a record using:

    result = ft.execute('select * from ad where _id = 1')
    listings = result
    assert(len(listings) == 1)
    listing = listings[0]
and have updated that record locally:
    listing['price'] = 80.00
we can now use the update command to simply push the complete value back to FreeTable:
    result = ft.execute('update ad set ? where _id = 1', listing)

Deleting data

To delete data from FreeTable use:
    result = ft.execute('delete from ad where price < 50')
This makes the data inaccessible to normal queries, but doesn't physically remove it from the system. The data can be reverted if need be.

To physically remove data from the system the purge command should be used. For instance to purge data one month or more old use:

    result = ft.execute('select current_time()')
    current_time = result[0]['current_time']
    result = ft.execute('purge from ad where _submit_time <= ?', current_time - 30 * 86400)

Purge only physically removes data that is inaccessible to normal queries. This includes curated revisions that are still pending review. Its use on a curated table should include a clause like where _review_state <> "pending". Data that is accessible to normal queries is left alone by purge.

Indexing the data

As things presently stand the table wouldn't be able to handle a large number of records, or a high query load. This is because it lacks indexes to make queries efficient. Let's add some:
    result = ft.execute('create index location_category_price on ad(location, category, price)')
    result = ft.execute('create index location_price on ad(location, price)')
    result = ft.execute('create index ft fulltext on ad(title, description)')

You want to add only those indexes needed by users of your table because unnecessary indexes consume memory and slow inserts and updates.

The last index is declared as a fulltext index. A fulltext index allows for within field searching of keywords as well as ranking of results. For instance:

    result = ft.execute('select * from ad where fulltext("cheap car")')
    for row in result:
        print row['title']
displays:
    Mini Cooper
    1966 Dodge-veg-o-matic

One word of warning when using fulltext indexes. There can be up to 10 seconds of latency between the time an operation is performed and the time it is reflected in the index. For most applications of fulltext indexes this shouldn't be a problem, but it can throw you off when you are testing things out.

Documentation and licensing

Since FreeTable is designed for the sharing of data, documenting for others the purpose of the table and the meaning of each column is vital. We have already provided comments for each of the columns when we created them, let us now provide documentation and licensing information for the table as a whole.
    result = ft.execute('''update information_schema.tables set {
        "advertise": true,
        "table_title": "Global public classified ads",
        "table_comment": "This table accepts listings from anyone.",
        "license_ids": ["ODC-ODbL-1.0"],
        "license_ids_required": [["Public_Domain"], ["CC-by-3.0"], ["CC-by-sa-3.0"], ["ODC-DbCL-1.0"]],
        "license_notices": "Copyright 2011 Eric Blair"
    } where table_name = "ad"''')

With a little foresight we could also have specified these parameters when we first created the table.

    result = ft.execute('''create table ad ( ... )
        advertise=true,
        table_title="Global public classified ads",
        table_comment="This table accepts listings from anyone.",
        license_ids=["ODC-ODbL-1.0"],
        license_ids_required=[["Public_Domain"], ["CC-by-3.0"], ["CC-by-sa-3.0"], ["ODC-DbCL-1.0"]],
        license_notices="Copyright 2011 Eric Blair"
    ''')

In either case setting advertise to true marks the table for publication in FreeTable's list of publicized tables. The table will then be able to be found using the search form located on the freetable.org home page.

Through a quirk in the law, copyright in databases and tables is treated differently than the copyright in data elements. FreeTable suggests utilizing a share-alike, GPL-like, license, for at least the table. This can be as shown above utilizing the Open Data Commons Open Database License. The license_ids field specifies the licensing terms for the table as a whole, while the license_ids_required field specifies what sorts of data elements can be included in the table. The constraint specified only applies to new elements added to the table.

If you favor a less restrictive attribution only, BSD-style, license, you could use something like:

        ...
        "license_ids": ["ODC-By-1.0"],
        "license_ids_required": [["Public_Domain"], ["CC-by-3.0"], ["ODC-DbCL-1.0"]],
        ...

In real life

This example has been given using a table in your local namespace. Others can access this table using your username, as user.testuser.ad, but if you are really going to make this table available, it would probably be better to give it a name in the global namespace, as global.ad. This can be done either when you create the table, or later using the rename table command.

Good table design

Good table design has a number of facets:
  • excluding index fields storing data only once
  • appropriate selection of table licensing terms
  • careful choice of table and column names
  • provision of documentation through the table title, table comment, and column comments
  • careful selection of indexes
  • use of default values for newly added fields of an existing table
  • not deleting existing fields of an existing table
  • use of JSON storage for large blocks of text

Limitations of FreeTable

To ease implementation, and to make it easier to re-architect the system in the future, a number of constraints are imposed by FreeTable.
  • where clauses are restricted to the form: <term_1> and <term_2> and <term_3> ... and terms are typically limited to the form <column_name> <op> <json_literal>
  • FreeTable queries are executed atomically, but there is no way to string multiple queries together and have them form a single transaction
  • also, as espoused by the NoSQL movement, joins are not supported

Advantages of FreeTable

  • able to share dynamic data
  • able to access and modify single rows
  • web interface
  • command line interface
  • FQL: a mixture of JSON and SQL
  • a Wiki like data model
  • roll forward/back history
  • data curation and the ability to either manually approve or automatically accept data changes
  • local and global table namespaces
  • fulltext search
  • low latency in memory data storage
  • load balanced HTTP transport
  • redundant hardware and monitoring
  • failover to read only slave mode
  • able to rapidly scale both logically and physically
  • data as a service business model

Under the hood

FreeTable runs at Amazon's EC2 on Ubuntu Linux machines. A load balancer distributes traffic across the FreeTable frontend. Home grown Python code is then used to parse and determine how to execute the user's query. The query is dispatched to the appropriate database machine containing the requested table. MySQL is currently used as the backing database to play the resulting queries against. Solr is used to perform fulltext search.

Cleaning up

It is important to cleanup after working your way through this tutorial:
    result = ft.execute('drop table ad')

The current FreeTable implementation stores all tables in memory, not on disk, and as a result storage is a critical resource. To prevent abuse, FreeTable intends to bill in the future for tables that are not actively being accessed.

To make sure you have cleaned up run

    show tables
either using ft.execute, or from the FreeTable CLI. It will list any tables that still exist in your namespace.

Where to go next

The FreeTable Developer Specification is a great place to go and learn more about the details of FreeTable including many commands not covered here.

Copyright (C) 2010 Gordon Irlam. Creative Commons Attribution-Share Alike 3.0 United States License. http://creativecommons.org/licenses/by-sa/3.0/us/