2 Iconography
Access to the text (HTML) Access to the text (HTML)
PDF Access to the PDF text

Access to the full text of this article requires a subscription.
  • If you are a subscriber, please sign in 'My Account' at the top right of the screen.

  • If you want to subscribe to this journal, see our rates

Journal of the American Academy of Dermatology
Sous presse. Epreuves corrigées par l'auteur. Disponible en ligne depuis le vendredi 31 janvier 2020
Doi : 10.1016/j.jaad.2019.07.014
accepted : 3 July 2019
Natural language processing of Reddit data to evaluate dermatology patient experiences and therapeutics

Edidiong Okon, BSE a, Vishnutheja Rachakonda, BS a, Hyo Jung Hong, BA b, Chris Callison-Burch, PhD a, Jules B. Lipoff, MD c,
a School of Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 
b Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 
c Department of Dermatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 

Correspondence to: Jules B. Lipoff, MD, Penn Medicine University City, 3737 Market St, Ste 1100, Philadelphia, PA 19104.Penn Medicine University City3737 Market StSte 1100PhiladelphiaPA19104

There is a lack of research studying patient-generated data on Reddit, one of the world's most popular forums with active users interested in dermatology. Techniques within natural language processing, a field of artificial intelligence, can analyze large amounts of text information and extract insights.


To apply natural language processing to Reddit comments about dermatology topics to assess for feasibility and potential for insights and engagement.


A software pipeline preprocessed Reddit comments from 2005 to 2017 from 7 popular dermatology-related subforums on Reddit, applied latent Dirichlet allocation, and used spectral clustering to establish cohesive themes and the frequency of word representation and grouped terms within these topics.


We created a corpus of 176,000 comments and identified trends in patient engagement in spaces such as eczema and acne, among others, with a focus on homeopathic treatments and isotretinoin.


Latent Dirichlet allocation is an unsupervised model, meaning there is no ground truth to which the model output can be compared. However, because these forums are anonymous, there seems little incentive for patients to be dishonest.


Reddit data has viability and utility for dermatologic research and engagement with the public, especially for common dermatology topics such as tanning, acne, and psoriasis.

The full text of this article is available in PDF format.

Key words : artificial intelligence, natural language processing, patient education, patient engagement, Reddit, social media

 Funding sources: None.
 Conflicts of interest: None declared.
 IRB approval: Exempt by University of Pennsylvania IRB.
 Reprints not available from the authors.

© 2019  American Academy of Dermatology, Inc.@@#104156@@
EM-CONSULTE.COM is registrered at the CNIL, déclaration n° 1286925.
As per the Law relating to information storage and personal integrity, you have the right to oppose (art 26 of that law), access (art 34 of that law) and rectify (art 36 of that law) your personal data. You may thus request that your data, should it be inaccurate, incomplete, unclear, outdated, not be used or stored, be corrected, clarified, updated or deleted.
Personal information regarding our website's visitors, including their identity, is confidential.
The owners of this website hereby guarantee to respect the legal confidentiality conditions, applicable in France, and not to disclose this data to third parties.
Article Outline