Below are the courses that will be available to QMSS students during Fall Semester 2017. Course information will continue to be updated as it becomes available. If you see discrepancies between this list and the Columbia Directory of Classes or Vergil, you should default to the details on this page.
Quantitative Theory and Methodology
Michael Parrot, Marco Morales, TBD
This interdisciplinary course, taken in the fall semester, is a comprehensive introduction to quantitative research in the social sciences. The course focuses on foundational ideas of social science research, including strengths and weaknesses of different research designs, interpretation of data drawn from contemporary and historical contexts, and strategies for evaluating evidence. The majority of the course is comprised of two-week units examining particular research designs, with a set of scholarly articles that utilize that design. Topics include: the “science” of social science and the role of statistical models, causality and causal inference, concepts and measurement, understanding human decision making, randomization and experimental methods, observation and quasi-experimentation, sampling, survey research, and working with archival data.
Data Analysis for the Social Sciences
This course is meant to provide an introduction to probability and social statistics, tailored to the types of analyses and data issues encountered by QMSS students. The chief goal is to help students generate and interpret quantitative data in helpful and provocative ways. The hope is that by trying to measure the social world, students will see their thinking become clearer and their understandings of concepts grow more complex. They will also become competent at reading statistical results in social science publications and in other media. Only basic mathematics skills are assumed, but it is hoped that students will become more facile with numbers, functions and their relationships. Another important goal of the course is to teach students how to manipulate and analyze data themselves using statistical software. We will focus mainly on the program R. There will be an optional lab section every other week, which will be devoted to using these software programs to practice commands and to develop a paper using the General Social Survey, World Values Survey or another dataset of the student’s choosing.
Time Series, Panel Data, and Forecasting
This course will introduce students to the main concepts and methods behind regression analysis of temporal processes and highlight the benefits and limitations of using temporally ordered data. Students study the complementary areas of time series data and longitudinal (or panel) data. There are no formal prerequisites for the course, but a solid understanding of the mechanics and interpretation of OLS regression will be assumed (we will briefly review it at the beginning of the course). Topics to be covered include regression with panel data, probit and logit regression of pooled cross-sectional data, difference-in-difference models, time series regression, dynamic causal effects, vector autoregressions, cointegration, and GARCH models. Statistical computing will be carried out in R.
GR5021 & GR5022
This course has two goals. One, it is designed to expose students in the QMSS degree program to different methods and practices of social science research. Seminar presentations are given on a wide range of topics by faculty from Columbia and other New York City universities, as well as researchers from other settings. Two, it is also designed to give students important professional development skills, particularly around academic writing, research methods and job skills.
VIEW PREVIOUS SYLLABUS HERE (NOTE: Speakers will differ from last semester)
Practicum in Data Analysis
This practicum course is meant to offer valuable training to students. Specifically, this practicum will mimic the typical conditions that students would face in an internship in a large data-intense institution. The practicum will focus on four core elements involved in most internships: (1) Developing the intuition and skills to properly scope ambiguous project ideas; (2) practicing organizing and accessing a variety of large-scale data sources and formats; (3) conducting basic and advanced analysis of big data; and (4) communicating and “productizing” results and findings from the earlier steps, in things like dashboards, reports, interactive graphics, or apps. The practicum will also give students time to reflect on their work, and how it would best translate into corporate, non-profit, start-up and other contexts.
The class is roughly divided into two parts: 1. programming best practices, exploratory data analysis (EDA), and unsupervised learning 2. supervised learning including regression and classification methods In the first part of the course we will focus writing R programs in the context of simulations, data wrangling, and EDA. Unsupervised learning is focused on problems where the outcome variable is not known and the goal of the analysis is to find hidden structure in data such as different market segments from buying patterns or human population structure from genetics data. Supervised learning deals with prediction problems where the outcome variable is known such as predicting a price of a house in a certain neighborhood or an outcome of a congressional race.
Natural Language Processing
Wayne Tai Lee
Social scientists need to engage with natural language processing (NLP) approaches that are found in computer science, engineering, AI, tech and in industry. This course will provide an overview of natural language processing as it is applied in a number of domains. The goal is to gain familiarity with a number of critical topics and techniques that use text as data, and then to see how those NLP techniques can be used to produce social science research and insights. This course will be hands-on, with several large-scale exercises. The course will start with an introduction to Python and associated key NLP packages and GitHub. The course will then cover topics like language modeling; part of speech tagging; parsing; information extraction; tokenizing; topic modeling; machine translation; sentiment analysis; summarization; supervised machine learning; and hidden Markov models. Prerequisites are basic probability and statistics, basic linear algebra and calculus. The course will use Python, and so if students have programmed in at least one software language, that will make it easier to keep up with the course.
GIS and Spatial Analysis
This course introduces students to basic spatial analytic skills. It covers introductory concepts and tools in Geographic Information Systems (GIS) and database management. As well, the course introduces students to the process of developing and writing an original spatial research project. Topics to be covered include: social theories involving space, place and reflexive relationships; social demography concepts and databases; visualizing social data using geographic information systems; exploratory spatial data analysis of social data and spatially weighted regression models, spatial regression models of social data, and space-time models. Use of open-source software (primarily the R software package) will be taught as well.
Modern Data Structures
This course is intended to provide a detailed tour of how to access, clean, “munge” and organize data, both big and small. (It should also give students a flavor of what would be expected of them in a typical data science interview.) Each week will have simple, moderate and complex examples in class, with code to follow. Students will then practice additional exercises at home. The end point of each project would be to get the data organized and cleaned enough so that it is in a data-frame, ready for subsequent analysis and graphing. Therefore, no analysis or visualization (beyond just basic tables and plots to make sure everything was correctly organized) will be taught; and this will free up substantial time for the “nitty-gritty” of all of this data wrangling.
Machine Learning for Social Science
This course will provide a comprehensive overview of machine learning as it is applied in a number of domains. Comparisons and contrasts will be drawn between this machine learning approach and more traditional regression-based approaches used in the social sciences. Emphasis will also be placed on opportunities to synthesize these two approaches. The course will start with an introduction to Python, the scikit-learn package and GitHub. After that, there will be some discussion of data exploration, visualization in matplotlib, preprocessing, feature engineering, variable imputation, and feature selection. Supervised learning methods will be considered, including OLS models, linear models for classification, support vector machines, decision trees, and random forests, and gradient boosting. Calibration, model evaluation and strategies for dealing with imbalanced datasets, n on-negative matrix factorization, and outlier detection will be considered next. This will be followed by unsupervised techniques: PCA, discriminant analysis, manifold learning, clustering, mixture models, cluster evaluation. Lastly, we will consider neural networks, convolutional neural networks for image classification and recurrent neural networks. This course will primarily us Python. Previous programming experience will be helpful but not requisite. Prerequisites: basic probability and statistics, basic linear algebra, and calculus.
This course is designed to help you make consistent progress on your master’s thesis throughout the semester, as well as to provide structure during the writing process. The master’s thesis, upon completion, should answer a fundamental research question in the subject matter of your choice. It should be an academic paper based on data that you can acquire, clean, and analyze within a single semester, with an emphasis on clarity and policy relevance. Remember that your thesis is not designed to be the crowning achievement of your career. If you find that the scale of your topic is too great, please choose a limited number of research questions to explore for the master’s thesis. Keep in mind that your time is limited! Early semester homework: Selecting a topic of interest is often the most difficult part of writing an academic paper, but deciding on the data you will be using is a significant step towards completing a satisfactory dissertation project. We will discuss your data before exploring plausible research designs. If you have elected to change topics from the literature review you prepared for G4010, let me know and begin researching other ideas so that you are prepared to move quickly through the semester.
Non-QMSS Concentration Classes
Prerequisites: ECON UN3211 and ECON UN3213 and ECON UN3412 and MATH UN2010 Students must register for required discussion section. The linear regression model will be presented in matrix form and basic asymptotic theory will be introduced. The course will also introduce students to basic time series methods for forecasting and analyzing economic data. Students will be expected to apply the tools to real data.
VIEW PREVIOUS SYLLABUS HERE
Prerequisites: ECON UN3211 and ECON UN3213 and ECON UN3412 and MATH UN2010 Required discussion section ECON GU4214 An introduction to the dynamic models used in the study of modern macroeconomics. Applications of the models will include theoretical issues such as optimal lifetime consumption decisions and policy issues such as inflation targeting. This course is strongly recommended for students considering graduate work in economics.
VIEW PREVIOUS SYLLABUS HERE
DATA SCIENCE CONCENTRATION
Prerequisites: Calculus This course covers the following topics: Fundamentals of probability theory and statistical inference used in data science; Probabilistic models, random variables, useful distributions, expectations, law of large numbers, central limit theorem; Statistical inference; point and confidence interval estimation, hypothesis tests, linear regression.
VIEW PREVIOUS SYLLABUS HERE
Algorithms for Data Science
Prerequisites: basic knowledge in programming (e.g., at the level of COMS W1007), a basic grounding in calculus and linear algebra. Methods for organizing data, e.g. hashing, trees, queues, lists, priority queues. Streaming algorithms for computing statistics on the data. Sorting and searching. Basic graph models and algorithms for searching, shortest paths, and matching. Dynamic programming. Linear and convex programming. Floating point arithmetic, stability of numerical algorithms, Eigenvalues, singular values, PCA, gradient descent, stochastic gradient descent, and block coordinate descent. Conjugate gradient, Newton and quasi-Newton methods. Large scale applications from signal processing, collaborative filtering, recommendations systems, etc.
VIEW PREVIOUS SYLLABUS HERE