Hacker News new | past | comments | ask | show | jobs | submit login
Free, Public Data Sets (jacquesmattheij.com)
273 points by iisbum on Feb 1, 2011 | hide | past | favorite | 51 comments



If anyone is looking for more datasets, see:

http://datasets.reddit.com

http://opendata.reddit.com

and

http://www.quora.com/Where-can-I-get-large-datasets-open-to-...

for some good lists of available stuff.


get.theinfo is the best way to find data sets. They are a bunch of data hoarders who can help you: http://groups.google.com/group/get-theinfo/?pli=1

I always ask there if I can't find what I'm looking for.

Here are more and more data sets. These are general data sets. Email me if you have a specific data set in mind (e.g. web-as-corpus, spam, images, social, reviews, etc.). I have a big file of information.

    http://theinfo.org/
    http://infochimps.org/datasets
    http://ckan.org [Comprehensive Knowledge Archive Network]
    http://www.datawrangling.com/some-datasets-available-on-the-web.html
    http://del.icio.us/pskomoroch/dataset
    http://www.reddit.com/r/datasets/
    http://news.ycombinator.com/item?id=1242029
    http://www.reddit.com/r/opendata
    http://www.trustlet.org/wiki/Repositories_of_datasets
    http://www.daniel-lemire.com/blog/data-for-data-mining/
    http://www.quantlet.org/mdbase/
    http://datamob.org/
    http://freebase.com/
    http://infochimp.info/ics/data/ripd/www-personal.umich.edu/~mejn/netdata/
    http://www.archive-it.org/public/all_collections

    Large:
        http://www.ckan.net/tag/read/size-large
        http://www.diggingintodata.org/Repositories/tabid/167/Default.aspx
Web as corpus:

    Good instructions:
        http://corpus.leeds.ac.uk/internet.html#description
    http://sslmit.unibo.it/~baroni/bootcat.html

    http://www.drni.de/wac-tk/index.php/Documentation
etc. Email me if you need more http://cleaneval.sigwac.org.uk/ http://liste.sslmit.unibo.it/pipermail/sigwac/2007-November/... http://wacky.sslmit.unibo.it/doku.php?id= http://clic.cimec.unitn.it/marco/research.html



The wikipedia dump is great, but I've started using http://wiki.dbpedia.org/ which has an API to query the dumps.

Thanks for these, iisbum. I wish more public data was available in db, xml or similar structures - too often I find myself scraping government sites or pdfs to get the tables I need


We've got quite a lot of public economic data: http://timetric.com/.

If you're up to something in the economic data space we'd love to talk. Happy to take this to email (andrew@timetric.com) if anyone's interested.


I looked at the site, and I see some data but I didn't find what I would have hoped for. I couldn't find yield curves, and historical exchange rates <i> up to <i/> today (available on the ecb site in xml format). Certainly I would have thought yield curves were a front page item.

Things that would be very cool would be 1. financial statements in a database format. I know you can scrape this but I don't know if they are available legitimately? 2. Historial Implied volatilities and historical observed volatilities.


http://timetric.com/dataset/exchange_rates_forex_europe/ for the exchange-rate data, at least.


Okay - it's there...

Is it your site? Are you going to add yield curves?



Heh, a day after he leaves HN he makes the first page. He will still be here whether he visits the site or not.


And I recently discovered Google Refine, for cleaning up messy datasets.

http://code.google.com/p/google-refine/



What about http://ckan.org/ ?

The Comprehensive Knowledge Archive Network! Pretty sweet resource really.


The CKAN software is a platform for hosting data and metadata, but as far as I see, http://ckan.org does not actually list data sets.


try http://ckan.net for the data, http://ckan.org is for the software behind it :)


Kinda surprised no one has mentioned Factual. I'm using some of their diabetes data for my side-startup.

http://www.factual.com/


Their write that most the data is available for download. I can't find it anywhere though, only the various APIs. Have they remove the possibility of downloading the data?


Don't forget Stack Overflow! http://data.stackexchange.com/


There is also the IMDB database in various format provided by IMDB itself here: http://www.imdb.com/interfaces

Edit: Although the use of this database is not free, I believe for personal use is just fine to download and experiment...


Non-Free Google data:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...

This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts.


Don't forget the Lahman Baseball Database with information from 1871-2010

http://baseball1.com/statistics/


And, for very detailed play-by-play data for decades of games, check out retrosheet: http://www.retrosheet.org/game.htm


https://datamarket.azure.com/

Some free, some paid.


http://infochimps.com also has a bunch.


For those interested in transit data, check out the GTFS Data Exchange, a directory of many agencies' scheduling and map data, following the Google Transit Feed Specification.

http://www.gtfs-data-exchange.com/


Open Directory RDF Dump: http://rdf.dmoz.org/


For international relations data, Correlates of War hosts a number of data sets: http://www.correlatesofwar.org/Datasets.htm


I have links to a few govt.-provided data sets at http://elev.at


United Nations stats (lots of goodies)

http://unstats.un.org

some free, some paid

http://infochimps.com/

AIS Data (Marine Traffic)

http://www.aishub.net/

http://www.marinetraffic.com/ais/

And there's a great list of sources on Quora

http://www.quora.com/Where-can-I-get-large-datasets-open-to-...




I track datasets that I come across at http://www.delicious.com/tobym/dataset



http://www.naturalearthdata.com/ From the website : Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales as tightly integrated vector and raster data ...


Does anybody have precinct-level election results for the USA? A set for recent elections would be great for public access redistricting apps that will become relevant this year.


anyone know of a dataset that has dates for when companies when companies registered or announced in the news? For example I would like to see the data hackernews was launched.


i've had trouble finding geographical boundaries on neighborhoods in U.S. cities (e.g. downtown areas and residential neighborhoods). anyone know where i can find this?


It's not exactly neighborhoods, but the US Census TIGER database has block and blockgroup boundaries with associated demographic data. You could probably synthesize that into "neighborhood" definitions. http://www.census.gov/geo/www/tiger/tgrshp2010/tgrshp2010.ht...



It's not in dump format, but you should take a look at simplegeo's (free) api: http://simplegeo.com/


Some US Gov't data sites no one else mentioned:

http://data.govloop.com/ has data and lots of pointers to local government data.

Also I'm surprised no one mentioned Carl Malamud's site: http://public.resource.org/ - Lots of US gov't and legal data in friendly formats.


Heaps of useful info: http://www.nationmaster.com


I'd prepared (based on other datasets) a smallish movie tweet dataset. You may find it useful, if working with tweets and/or reviews.

https://github.com/mohitranka/TwitterSentimentCorpora


CIA World Factbook (demographics, geography, communications, government, economy, military stats of countries):

https://www.cia.gov/library/publications/download/


Thank you all for posting links and links to links to datasets, I have an unrelenting interest in data aggregation and machine learning, and didn't even know where to start. So helpful, and I am no longer stuck. :)


do all of them have some uniformed api? that would be great, ideally. query and cache all of them on demand from your own app without additional programming.

bookmarked and shared this thread.


I looking for free public domain large high-resolution imaging datasets.

Something like satellite imagery, medical imaging, semiconductor masks and wafers photos or CAD files, etc.

Any pointers?


Here are medical imaging datasets I am aware of: Neuroimaging (see http://www.nitrc.org for others) OASIS http://www.oasis-brains.org/ ADNI http://adni.loni.ucla.edu/ (huge dataset, requires application) OpenfMRI http://openfmri.org/ EEG http://eeg.pl/epi

Some other applications, example CT Colonography http://www.acrin.org/


This is a real treasure to come across. I hope we'll keep seeing jacquesm's blog postings here.

Anyone know of any publicly available song lyric databases?


Anyplace I can find _small_ free web spam dataset? ( for commercial use, sorry :( )

All the datasets I found on www, are Huge (in double digit GBs..).


Wow, useful stuff. This thread goes into my bookmarks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: