PythonOER
An open resource for python information. This was developed and curated as a resource related to healthcare, data science, and python. Scope creep abounds. Suggestions welcome. Thanks for visiting.
Table of Contents
General References/Resources:
https://www.mastersindatascience.org/
List of lists, approachable review of job titles, programs that teach data, and courses online; owned by 2U (edx)
Great teaching and practice of fundamentals ; decreases barrier to entry , enable data-driven research
Pitt resource from Dr. Peter Brusilovsky
defunct as of 2022
Online Courses Related to Python:
Forbes Best Free online data science courses in 2019 - Link
Data Camp - Link
Python plus many other languages
Data Science Plus - Link
Impressive review of Stats Methods (Linear regression)
Kaggle - Kaggle.com
Data science community with full and short courses ; public data sets ; public examples of data projects ; competitions, great resource for introductory learners
SUNY OLI - Link
Open learning resources, Principles of Computation with Python, course with content/quizzes, includes programming/data/encryption/cellular automaton, has free access to view and requires free account for full features
Artificial Intelligence in Healthcare, Stanford Certificate program - Link
Four course series related to clinical data, fundamentals of ML for healthcare, Evaluation of AI in Healthcare, and Capstone
OurCodingClub - Python Tutorial Link
Github site hosted course with easy to follow tutorials, originally meant for Ecosciences
Learning Python: basic level Link
Kevin Dunn's fantastic review of concepts. From the site: The topics listed below give an idea of what is covered. Within in each notebook are a series of simple or more challenging problems. The problems are designed to build on the topics just learned, as well as the topics from earlier notebooks.
Lab in Cognition and Perception online textbook by Todd Gureckis, Brenden M. Lake and others Link
While thematically based on cognition/perception, has a hands-on approach to Jupyter and Python. Great content for concepts like Linear Regression and Python resources
Coursera
YouTube Content
Google Cloud Platform - Link
Search of "Google Cloud Tech" for "python" yields lots of information about python as a language, as a data tool, and as an tool in technical (computer-science styled) applications
Python Tutorial for Beginners - Link
User "Telusko" ; 100 + lessons
Download/Install python, Using python, Getting started with python, Variables, Functions, Object, List in python and more by navin reddy
Approx 10 - 15 minutes videos
Python Beginner Course - Link
Basics of syntax ; Some specific exercises like building a random color generator
“Data Professor” - Python search playlist - Link
Vast array and breath/depth of topics in data science from single user including language specific content, conceptual videos, and interviews with others in data science
"Healthcare Data Analytics" - Link
Tools as a resource "Health IT Curriculum Resources for Educators" from HealthIT.gov's Workflow Development program
Working with Medication Data - Link
Other Locations of Information:
OHSU - Wiki that reviews other resources - Main page, Article list
Mainly a review of clinical informatics (how does a clinician informaticists think about systems and data)
“The Ultimate Data Science Prerequisite Learning List” - Link
List of List
Broad definition and application examples of math concepts in data science and high level review of programming in data science
ML YouTube Courses - Link
Hosted on Github by DAIR.AI, includes brief outline of each below whole list
Best of ML with Python -Link
List of open source machine learning projects hosted on Github
Johns Hopkins Data Science Lab - https://jhudatascience.org/index.html
Great resource of MOOC for R, open textbooks (Leadership in Cancer Informatics) and other resources.
Statistics Materials:
YouTube “Stats Quest” - Josh Starmer - Link
Excellent review in breadth and depth of topics. Videos like "Gentle Introduction to Machine Learning" are immensely valuable to new learners.
UTHealth - Biostatistics for the Clinician - Link
YouTube - Brandon Fultz - Link
BMJ Statistics - Link
Kenyon - Biology - Link
Health Knowledge UK, Public Health Textbook, statistical methods section - Link
StatR Analysis - Which test to choose - Link
Medium, Towards Data Science - “Everything You Need To Know about Hypothesis Testing — Part II” - Link
Open UMich Introduction to Statistics - Link
Seeing Theory, online book - Link
Fantastic review of statistics in an interactive online format
Statology - Link
Good review of tests, great section in "software tutorials" to explain how to conduct tests in Excel, Google Sheets, Python, R, etc
Books:
Clinical Data as the Basic Staple of Health Learning: Creating and Protecting a Public Good: Workshop Summary - Link
Part of the "Learning Health System" series
Neural Data Science : A Primer with MATLAB® and Python - Link (Available via Pitt HSLS)
Python for Bioinformatics Link (Available via Pitt HSLS)
Python for Everybody - Link
Trinket Open book
Very easy to read
Hands-on exploratory data analysis with python : perform EDA techniques to understand, summarize, and investigate your data - Link (Pitt ULS)
Hands-On Machine Learning with Python and Scikit-Learn - Link (Available via Pitt HSLS, 2 hours of video)
Hands-On PySpark for Big Data Analysis - Link (Available via Pitt HSLS)
Become a Python Data Analyst: Perform Exploratory Data Analysis and Gain Insight into Scientific Computing Using Python - Link (Available via Pitt HSLS)
Learn Data Analysis with Python: Lessons in Coding - Link (Available via Pitt HSLS)
Hands-On Data Analysis with Pandas: Efficiently Perform Data Collection, Wrangling, Analysis, and Visualization Using Python - Link (Available via Pitt HSLS)
Python for data analysis : data wrangling with pandas, NumPy, and IPython - Link (Available via Pitt HSLS)
Hands-On Data Visualization: Interactive Storytelling from Spreadsheets to Code, Jack Dougherty Ilya Ilyankou - Open book available via GitHub - Link
Codeless Deep Learning with KNIME: Build, train, and deploy various deep neural network architectures using KNIME Analytics Platform - Link
Datasets:
Mendeley Data Sets - Link
Mendeley Data is a free and secure cloud-based communal repository where you can store your data, ensuring it is easy to share, access and cite, wherever you are.
Synthetic Cardiovascular Risk Dataset, Github hosted - Link
CVD risk data
National Sleep Research Resource, Sleep data sets - Link
Teaching Statistics in the Health Sciences - Link
Teaching Statistics in the Health Sciences, a great repository that matches Data sets (mostly CSV) and papers that use them
UC Irvine Machine Learning Repository - Link
Popular reference for machine learning data sets
Berkley Library, University of California, Health Statistics & Data: Datasets/Raw Data - Link
EMRBots - https://github.com/kartoun/emrbots
Experimental artificially generated electronic medical records (EMRs), Wiki articles
Harvard Dataverse, Med/Health/Life Science tag - Link
Synthetic EMR Data Set - https://synthea.mitre.org/Synthethic ; Mitre Mass
SyntheticMass contains realistic but fictional residents of the state of Massachusetts. The synthetic population aims to statistically mirrors the real population in terms of demographics, disease burden, vaccinations, medical visits, and social determinants.
NIMH Data Archive - https://nda.nih.gov/
National Institute of Mental Health Data Archive (NDA) is a single infrastructure that was initially created through the integration of a set of research data repositories
Summary information on the data shared in NDA is available in the NDA Query Tool without the need for an NDA user account. To request access to record-level human subject data, you must submit a Data Access Request.
Integrated Public Use Microdata Series - Link
IPUMS International is dedicated to collecting and distributing census microdata from around the world.
Harmonized International Census Data for Social Science and Health Research
Clinical Practice Research Datalink (CPRD) - Link
Proprietary, Limited Access
Clinical Practice Research Datalink (CPRD) is a real-world research service supporting retrospective and prospective public health and clinical studies. CPRD research data services are delivered by the Medicines and Healthcare products Regulatory Agency with support from the National Institute for Health and Care Research (NIHR), as part of the Department of Health and Social Care. (United Kingdom)
Medical Imaging, Osteoarthritis Initiative (OAI) - https://nda.nih.gov/oai/
This website contains the permanent archive of the clinical data, patient reported outcomes, biospecimen analyses, quantitative image analyses, radiographs (X-Rays) and magnetic resonance images (MRIs) acquired during this study. There are longitudinal assessments and measurements from 4,796 subjects, with data from over 431,000 clinical and imaging visits, and almost 26,626,000 images in this archive. More than 400 research manuscripts have already been generated based on this data.
NHS, Business Services Authority, Prescribing data - Link
We provide prescribing, dispensing and organization data to help NHS stakeholders track trends and to inform decisions. Using a wide range of information based on prescribing and dispensing, we create reports specific to user needs and requirements.
Opendata NHS, Scotland - Link
Prescriptions in the Community
The Scottish Health and Social Care open data platform gives access to statistics and reference data for information and re-use.
AHRQ, Synthetic Healthcare Database for Research (SyH-DR) - Link
The Synthetic Healthcare Database for Research (SyH-DR) is an all-payer, nationally representative claims database. The database consists of a sample of inpatient, outpatient, and prescription drug claims, including utilization, payment, and enrollment data, for people insured by Medicare, Medicaid, or commercial health insurance in 2016. AHRQ created SyH-DR, in part, as a resource to facilitate improvements to price and quality transparency in healthcare.
Synapse - Data repository for publication data. Includes a subsection for digital health and biomarkers
FDA Github
Great starting point for learning about the variety of information available from the FDA via Github. Includes documentation for FDA APIs.
Mimi Labs - Data Catalog, great references from a great group
Sage Data Planet
Fun Reads / Videos:
Machine Learning in Medicine - Link
Virtual Seminar series from Carnegie Mellon University
O’Reilly Training - Youtube playlist Link
Bringing AI to the Underserved Billions - Link, Ted Talk 12 minutes
How to keep human bias out of AI - Link, Ted Talk 12 minutes
ONC Overview of HealthCare Data Analytics - Link 20 minutes
Data Science in Healthcare, PyData NYC 2018 - Link
BD2K - Exploratory Data Analysis - Link 60 minutes
University of Virginia, Exploratory Data Analysis” - Link 20 minutes
Python for Industry Pharmaceuticals and Healthcare - Link (4 minutes)
Python vs R vs SAS, Simplilearn - Link (20 minutes)
Machine Learning Crash Course (Anaconda, 60 min) - Link
What is Machine Learning (Google Cloud Platform, 5 min) - Link
Machine learning without code in a browser (Google Cloud Platform, 10 min) - Link
All My Pharmacy Students Learn to Code - Link
Article by David Berkowitz about the role of learning skills for clinical students
Python: Go From Rookie To Rockstar, by Abhishek Verma, Nov-2021 - Link
Simple article that covers many basics of Python
Data Science Solutions for Digital Healthcare - Link
Collection of low/no code workflows when working with KNIME Platform. ex. Vanco dosing in obesity, Predicting patient glucose levels, Natural Language Processing for disease tagging in literature.
Data Literacy for the Busy Librarian - Link
10 videos from National Librarian of Medicine (Shout out Nancy Shin from UWashington) that covers Data life cycle, documentation, standards, SNOMED, RxNorm, UMLS, Data security, Data sharing, Visualizations. Videos are 5-20 minutes. Easy to understand and great introductions.
An introduction to Python for R Users - Link
Blog post with a basic walkthrough for basic functions like libraries.
Learning through Application / Cases:
Example applications or tests of knowledge/skills
Public vs. Private payer data sets Link
Comparing patient records at a sample data set from public vs. private payers
Considerations of medication and classification of opioids
Created by Dominic DiSanto as a part of University of Pittsburgh Office of the Provost Open Education Resources Grants
No Code Machine Learning, Google Creative Labs -
"Experiments with Google" is a collection of open, accessible applications of machine learning, artificial intelligence, and development cycles related to various data.
Machine learning without code in the browser (Link YouTube, Google, 10 minutes) - Helpful overview/walkthrough of the website and correlates to steps in machine learning
"Experiment with Google" lab: Teachable Machine (Link) - Google Creative Lab, no coding required, this launches the "webcam"-based model (others include Audio based)
Tools:
Literature
Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) - https://www.tripod-statement.org/
TRIPOD Checklist
Method to report study results with a TRIPOD checklist
Literature:
Journals, Collections
Email/Updates
PLOS, arXiv, UConn Center for mHealth and Social Media (Mailing List)
Method to learn via updates and inspire further investigation
Nature Journal - Scientific Data library - Link
“Acquiring and Using Electronic Health Record Data” - Link
From the NIH Pragmatic Trial Collaborative - RethinkingClinicalTrials
Articles
How to Read Articles That Use Machine Learning Users’ Guides to the Medical Literature - Link
Helpful review of concepts related to machine learning in clinical contexts
A Machine Learning Approach for the Detection and Characterization of Illicit Drug Dealers on Instagram: Model Evaluation Study - Link -Python for language analysis
A validation of machine learning-based risk scores in the prehospital setting - Link
CHIME: COVID-19 Hospital Impact Model for Epidemics - Link
A machine learning approach predicts future risk to suicidal ideation from social media data - Link
“A dataset quantifying polypharmacy in the United States” - Link
Coding Errors in Study of Meta-analyses With Falsified Data in the Results” - Link
“Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records” - Link
“Artificial intelligence-assisted clinical decision support for childhood asthma management: A randomized clinical trial” - Link
Prescriptive analytics for reducing 30-day hospital readmissions after general surgery - Link
Data from national group, available if a member (UPMC)
Forecasting outbound student mobility: A machine learning approach - Link
Data from Taiwan
Use of social media big data as a novel HIV surveillance tool in South Africa - Link
Consider recreating python code from web scraping as an exercise (library - tweepy)
Forecasting seasonal influenza-like illness in South Korea after 2 and 30 weeks using Google Trends and influenza data from Argentina - Link
Interesting methods, would be able to recreate analysis but navigation is in Korea
Characterizing electronic health record usage patterns of inpatient medicine residents using event log data - Link
Has github link but original in article not working, general case for use of python
Deep neural network models for identifying incident dementia using claims and EHR datasets - Link
Very technical but relevant outline, Pitt has access to Optum
Subtypes in patients with opioid misuse: A prognostic enrichment strategy using electronic health record data in hospitalized patients - Link
Focus on Latent class analysis (LCA), not many relevant code or ML info, does discuss NLP
Predicting adverse drug reactions of combined medication from heterogeneous pharmacologic databases - Link
Available data, implemented SVM and k-NN using sklearn (with downloadable code). Would just require some re-organizing of materials
Performance and clinical utility of supervised machine-learning approaches in detecting familial hypercholesterolaemia in primary care - Link
Article with Github code link with R
Personalized prediction of early childhood asthma persistence: A machine learning approach - Link
Github link to python code, in the article some pseudo code given, and some sample csv data files
Machine-learning-based prediction models for high-need high-cost patients using nationwide clinical and claims data - Link
UCLA+Japan, No data/code in the article
Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19 - Link
Great figure explaining the process
Teaching data science fundamentals through realistic synthetic clinical cardiovascular data - Link
Interesting review of how to teach data science
Applications of machine learning to undifferentiated chest pain in the emergency department: A systematic review - Link
Relevant conclusion - Machine learning can be better than clinician but is rarely incorporated into practice
Beyond performance metrics: modeling outcomes and cost for clinical machine learning - Link
Editorial
Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review - Link
Using machine learning to study the effect of medication adherence in Opioid Use Disorder - Link
Comparison of machine learning approaches (XGBoost, Logistic Regression) for outcomes
Adverse drug event detection using natural language processing: A scoping review of supervised learning methods - Link
Discussion of how NLP concepts apply to detection in clinical notes/narratives
Long-term Effect of Machine Learning-Triggered Behavioral Nudges on Serious Illness Conversations and End-of-Life Outcomes Among Patients With Cancer: A Randomized Clinical Trial - Link
Interesting application of machine learning for behavioral nudges of prompting serious illness conversations
Predicting physician departure with machine learning on EHR use patterns: A longitudinal cohort from a large multi-specialty ambulatory practice Link
Uses XGBoost for predication, Shapley Additive Explanations (SHAP) used for feature contribution. The neat component is the granualar detail (though without the data) from the cleaning process available on Github
Machine learning to improve frequent emergency department use prediction: a retrospective cohort study Link
Tested models against each other: Gradient boosting machines (GBM); Naïve Bayes (NB); Neural networks (NN); Random forests (RF). Similar to other studies, no model clearly outperformed the others.
Natural Language Processing for Automated Quantification of Brain Metastases Reported in Free-Text Radiology Reports Link
Review of a Natural language text approach to free-text radiology reports, a great sample file in the Github report that describes the search process (no data available though), iteration (great example of how to run multiple tests), and viewing results.
Predictive models in emergency medicine and their missing data strategies: a systematic review Link
Good review for missing data in healthcare, good figure for comparing dropping vs. Mean/Mode vs. Imputation
Machine learning in medicine: Performance calculation of dementia prediction by support vector machines (SVM) Link
Application of SVM for predicting Dementia, data available via Mendeley Data for 149 observations (non-binary outcomes: Non-Demented, Demented, Converted)
Classification of lapses in smokers attempting to stop: A supervised machine learning approach using data from a popular smoking cessation smartphone app Link
Okay article for reviewing use of classification across Random Forest, Support Vector Machine, Penalized Logistic Regression, and Extreme Gradient Boosting.
Edit Notes - This is a living list with updates and edits, last updated: October 2024
Last updated