Datasets for Research or Teaching Projects

Comprehensive Collections (for multiple purposes)
  • ForeverData™ is a permanent online archive of temporal data for researchers. Data collected from specific websites over time and data acquired on a regular basis can find a permanent home here, with its contents updated over time and access made available through an online database for subsequent searching and access.
  • Ten most popular datasets on the US government portal
    Datasets for correlational analysis
    • Google Correlate showcase Google search data that correlate with real-world data
    Datasets for social network analysis
     Datasets for machine learning
    • Kaggle hosts free (as in free lunch) problems for hundreds of universities around the globe to
      engage students with an opportunity to apply machine learning to real problems.  
    Corpus datasets for statistics that can be used for text analytics
    • The Brown Corpus was the first computer-readable general corpus of texts prepared for linguistic research on modern English. It was compiled by W. Nelson Francis and Henry Kučera at Brown University in the 1960s and contains of over 1 million words (500 samples of 2000+ words each) of running text of edited English prose printed in the United States during the calendar year 1961. 
    • Department of Homeland Security's list of suspicious words for cyber attacks
    Marketing
    • Google Trends provides Google search data of any search phrase of your choice over time (in a time series format)
    • Google Correlate showcase Google search data that correlate with real-world data
    Finance
    Accounting
    Economics
    • The Harvard Dataverse Network is free* and open to all researchers worldwide to share, cite, reuse and archive research data. by Bureau of Economic Analysis
    • Real Personal Income for State and Metropolitan Areas, 2008-2012 by Bureau of Economic Analysis
    • GDP per capita
    • GDP growth rate
    • Inflation rates
    • Population density
    • Statistics of US Business (SUSB) (formerly Business Information Tracking Series (BITS)
    • Survey of Income and Program Participation (SIPP) is the premier source of information for income and program participation. SIPP collects data and measures change for many topics including: economic well-being, family dynamics, education, assets, health insurance, childcare, and food security.
    • The Current Population Survey (CPS), sponsored jointly by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics (BLS), is the primary source of labor force statistics for the population of the United States. The CPS is the source of numerous high-profile economic statistics, including the national unemployment rate, and provides data on a wide range of issues relating to employment and earnings. The CPS also collects extensive demographic data that complement and enhance our understanding of labor market conditions in the nation overall, among many different population groups, in the states and in substate areas. 
    • Survey of Business Owners by US Census 
    • The Longitudinal Employer-Household Dynamics (LEHD) program is part of the Center for Economic Studies at the U.S. Census Bureau. The LEHD program produces new, cost effective, public-use information combining federal, state and Census Bureau data on employers and employees under the Local Employment Dynamics (LED) Partnership. State and local authorities increasingly need detailed local information about their economies to make informed decisions. The LED Partnership works to fill critical data gaps and provide indicators needed by state and local authorities. 
    • County Business Patterns (CBP) is an annual series that provides subnational economic data by industry. This series includes the number of establishments, employment during the week of March 12, first quarter payroll, and annual payroll. This data is useful for studying the economic activity of small areas; analyzing economic changes over time; and as a benchmark for other statistical series, surveys, and databases between economic censuses. Businesses use the data for analyzing market potential, measuring the effectiveness of sales and advertising programs, setting sales quotas, and developing budgets. Government agencies use the data for administration and planning. 
    • Business Employment Dynamics is a set of statistics generated from the Quarterly Census of Employment and Wages program. These quarterly data series consist of gross job gains and gross job losses statistics from 1992 forward. These data help to provide a picture of the dynamic state of the labor market. 
    • The Job Openings and Labor Turnover Survey (JOLTS) program produces data on job openings, hires, and separations.
    • The National Longitudinal Surveys (NLS) are a set of surveys designed to gather information at multiple points in time on the labor market activities and other significant life events of several groups of men and women. For more than 4 decades, NLS data have served as an important tool for economists, sociologists, and other researchers.  
    • The United Nations Databases
    • The Kauffman Index is report from the Kauffman Foundation bringing together the latest data available on entrepreneurial trends nationally, at the state level, and for the 40 largest metropolitan areas of the United States. 
    Entrepreneurship
    • The Kauffman Index is report from the Kauffman Foundation bringing together the latest data available on entrepreneurial trends nationally, at the state level, and for the 40 largest metropolitan areas of the United States. by Transparency International
    Global
     Information Systems
    • Digital Evolution Index, created by Mastercard and Tufts University, is designed to map countries’ journeys, measure the rate of change in digital evolution across the globe, and provide actionable, data-informed insights for your business
    • ICT Development Index, created by the United Nations International Telecommunication Union, is designed to benchmark each country's ICT development, based on internationally agreed information and communication technologies (ICT) indicators. It is also a standard tool for measuring the digital divide.
    • DataLossDB is a database of data breach incidents
    • Department of Homeland Security's list of suspicious words for cyber attacks 
    • Google Correlate showcase Google search data that correlate with real-world data  
    • Google Trends provides Google search data of any search phrase of your choice over time (in a time series format) 
    Government/ Public Sector
    • Pittsburgh Data Forum is a platform for Pittsburghers to tell City government what information you’d like to see and why. We want to make sure that the data we share is useful and interesting to citizens, so please tell us your story! In early 2014, the City of Pittsburgh passed Open Data legislation setting the City’s default to open and committing to work with the public to develop a portal through which anyone can access data collected and/or maintained by City Government.
    • Pittsburgh LocalData Collaborative (PLDC) is a collaborative effort dedicated to making real-time decisions that improve our communities based on accurate and timely data that is open and accessible to everyone.
    • Data.gov, the central site for U.S. Government data, is an important part of the Administration’s overall effort to open government
    • City of Pittsburgh budget data
    • Southwestern Pennsylvania (SWPA) Community Profiles Data System: A new way to collect, analyze, and understand information across a range of domains to look at our neighborhoods and communities in a comprehensive data fashion. SWPA Community Profiles presents community data and indicators in a series of interactive tables and maps. With data and indicators from local, state, and federal government sources, along with a select set of other databases, SWPA Community Profiles will allow users to understand and visualize data along a range of geographic areas in our communities and region. Data are organized along 11 data domains: arts and culture, demographics, economy, education, environment, governance and civic vitality, health, housing and properties, human services, public safety, and transportation. 
    • The Western Pennsylvania Regional Data Center supports key community initiatives by making public information easier to find and use. The Data Center provides a technological and legal infrastructure for data sharing to support a growing ecosystem of data providers and data users. The Data Center maintains Allegheny County and the City of Pittsburgh’s open data portal, and provides a number of services to data publishers and users. The Data Center also hosts datasets from these and other public sector agencies, academic institutions, and non-profit organizations. The Data Center is managed by the University of Pittsburgh’s Center for Social and Urban Research, and is a partnership of the University, Allegheny County and the City of Pittsburgh.
    • Medicare payments and claims datasets are available for downloading from Centers for Medicare and Medicaid Services
    Financial Forensics
    Women

    No comments:

    Post a Comment