Introduction to Python in Clinical Trials

Introduction to Python in Clinical Trials

A step towards AI/ML in Clinical Trials

We are in the last quarter of 2021, and even today when someone brings up Clinical Trial Data Analysis or CDISC SDTM mapping or programmatically performing QC of outputs/datasets, most (99%) programmers will think of using SAS to do the job.

This defacto choice often results in Clinical Data programmers or Statistical Programmers identifying themselves as "SAS Programmers". There are various reasons for this, and a few are -

  • Inbuilt features that match regulatory agency requirements
  • Industry dominance i.e. everyone (almost) uses SAS
  • Thoroughly tested
  • Good support from industry experts as well as from SAS

There are a few problems with this though. Let's go through a few -

  1. Illusion of not having choice - The strong co-relation between SAS and Clinical Data gives a false illusion to SAS Programmers that SAS is the only choice available to them to perform any data related task. SAS is certainly not the only tool available. In some companies tools like I-Review, J-Review, Spotfire, SPlus are used for specific needs and exploratory analysis. Some of these tools are mostly not available as they need complicated installations and IT admins to approve installation requests as the software are sold under commercial license.

  2. Less accessible - SAS is a licensed software and and is not readily available/accessible to install and try out, on personal machines or even office computers for experimentation. You can of-course use the educational version available but that has quite a few constraints especially related to the number of dataset observations it can process. The installation too is a bit hectic wherein one needs to go via a virtual machines. Not sure what I am talking about? That's precisely the issue. In larger organisations SAS is usually only available on the server where uploading anything is not permitted, let alone experimenting with data. This limitation often creates a very strong correlation between SAS - Data Analysis - Office; and programmers are often caught in it for years.

  3. Cost - We may not know the exact cost of SAS license but we do know that it is very costly. Although programmers who use SAS in their jobs don't have to worry about the cost of software license, it is important to note that the cost of license does affect the overall cost of clinical research. At the end of the day, someone has to pay for it. It is practically impossible for an individual programmer to go purchase a SAS license and start experimenting.

  4. SAS is just a programming language - That’s right, SAS is just one of the many programming languages available to programmers to analyse/ organise/ review data. It’s certainly not the only tool/language for statistical programming or CDISC SDTM mapping.

  5. Limited support for modern data structures - If you want to read in from Excel / SQL Database and assuming you have the SAS licensed module (SAS ACCESS) to do that, you are in luck. If you happen to download data from the internet in XML / JSON / ORC formats, then you cannot read these in without workarounds and botches. In summary, any tabular data is fine but nested data or data with its own markup will be a problem.

How Python or R can help solve the above problems

Python and R have been around for quite a few years and have also been used sparingly in Clinical Trials by small and large sponsor companies and CROs but probably never as a first class citizen. They have always been used only for supplementary analysis or creating visualisations. Statisticians have been using R language and R Studio too for exploratory analysis or real world studies.


A few advantages of using Python in this context are as below -

  • It is an alternative - Python as a programming language offers a alternative approach towards analysis and data mapping in Clinical trials. Data management and Statistical programming teams can very well use Python for creating edit checks, SDTM mapping, double programming / QC of datasets and outputs

  • Free / No licensing - Python is open source and there is no license fee for using it. There are a lot of Python packages like pandas, numpy, scikit, matplotlib which are also free to use.

import pandas as pd
  • Highly accessible - Assuming you have installation rights on your laptop/workspace, installing python is just like installing any other software. Some users go via Anaconda but that's optional. Since the software is free and open-source, IT administrators don't hesitate to install it on your laptop if you request them to. You are suddenly no longer expected to work only on some server or office machine, but you can experiment with data analytics even on your home computer.

  • Multiple & mature packages - Python has many mature packages available to use. For e.g. you can use pandas and numpy for data crunching and analysis; you can use requests to interact with Web APIs etc. No matter what the use case, you will find a package for it. I have used pandas and numpy extensively for CDISC SDTM Mapping and writing checks on raw and SDTM data. To install a package, just give the command as below and you are done. You don't need administrator access for this.

c:\projects > pip install pandas
  • Can read any data (even SAS/XPT datasets) - There are packages that can help you to read data in any form including SAS datasets (SAS7BDAT) and XPT files. Python can also very easily read XML, JSON, ORC, EXCEL, CSV, TSV files. This is not a complete list but you should be able to comfortably read any data format you need.

  • Community support - Because Python is used extensively in data sciences, artificial intelligence and machine learning, you will find a ton of resources only and on stack overflow with questions around data crunching and analysis, summarisation, aggregation, reporting, and visualisation. It's literally just a search away. Here is an screenshot of StackOverflow as of today with the number of questions/articles on it related to Python. As you can see, 1.8 million questions is not a small number. Comparatively, there are only 40k questions on SAS.




The above points are based on my experience working on SAS for about two decades and Python/R for last couple of years. The idea behind this article is to highlight a viable alternative to the default approach (SAS) taken today by almost all SAS Programmers working in the Clinical Trials area.

Hope you liked the article. Please leave your comments, feedback, questions and suggestions in the comments below.

Did you find this article valuable?

Support Allwyn Dsouza by becoming a sponsor. Any amount is appreciated!