-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathgenomePlotBasic.R
356 lines (257 loc) · 12.4 KB
/
genomePlotBasic.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
# genomePlotBasic.R
#
# Purpose: Demo a genome plot of genes on a line.
#
#
# Version: 1.0
# Date: 2018 03 18
# Author: Boris Steipe (boris.steipe@utoronto.ca)
#
# Dependencies:
# readr package
#
# License: GPL-3 (https://www.gnu.org/licenses/gpl-3.0.en.html)
#
# Version history:
# 1.0 Final version for Biohacks 2018
# 0.3 Added an introduction section and more comments
# 0.2 Updated gene data
# 0.1 Derived from genomePlotDemo.R
#
# ToDo:
# - ...
#
# ==============================================================================
#TOC> ==========================================================================
#TOC>
#TOC> Section Title Line
#TOC> ------------------------------------------------------
#TOC> 1 INTRODUCTION 47
#TOC> 2 PARAMETERS 100
#TOC> 3 PACKAGES AND FUNCTIONS 125
#TOC> 4 PROCESS 134
#TOC> 4.1 READ SOURCE DATA 139
#TOC> 4.2 INITIALIZE DATA STRUCTURES 147
#TOC> 4.3 ANNOTATE 165
#TOC> 4.4 LAYOUT 172
#TOC> 4.5 PLOT 286
#TOC> 4.5.1 Compute scale and translation 289
#TOC> 4.5.2 Write SVG header 316
#TOC> 4.5.3 Render all elements 321
#TOC> 4.5.4 Write SVG footer 332
#TOC> 5 FINISH 336
#TOC>
#TOC> ==========================================================================
# = 1 INTRODUCTION ========================================================
# This script demonstrates a data-driven visualization of genome scale data. It
# is a very basic example, but written in a modular way that should make it easy
# to extend into something more interesting. Extensions of the code are provided
# in genomePlotDemo.R and genomePlotIntermediate.R (There is no "advanced"
# version - that would be your code).
# The code proceeds through five steps:
# 1 - Read the source data:
# A single data file containing gene annotations is read into a
# data frame. Extensions would read any additional kind of data
# that contributes to the final image at this point.
#
# 2 - Initialize data structures:
# In this step, information objects are defined. In this demo we define
# only gene information, and we store it in a data frame, one row per
# gene. Extensions would add non-gene information like general
# phentotypes, categories of ethnicity, geaographical distribution,
# a map of subcellular locations, tissue diagrams, explanatory text
# and legends, network data - whatever the visualization should include.
#
# 3 - Annotate:
# In this demo, no further annotation is performed. Extensions would
# correlate, compile and contrast data, to populate attributes of our
# data structures with values.
#
# 4 - Layout:
# In this step data is mapped to visuals. We call these "shapes", and
# we build them from the items in our data structures. In this
# script we define a single line to represent Chr 20, add a text label,
# and plot each gene as a coloured rectangle on the line, where
# position represents the chromosomal location. This step implements
# the main vision about how the data should be visualized and we
# expect that it will absorb the majority of your effort. Extensions
# would include arranging shapes in a variety of different basic
# layouts, adding lines and curves that connect shapes to symbolize
# processes and relationships, implementing masks, and layers,
# optimizing placement of shapes with algorithms such as
# force-directed layout, or hierarchical trees,
# transforming sections of the layout with local deformations such as
# perspective and hyperbolic lenses and much, much more.
#
# 5 - Plot:
# In this step the shapes that have been defined previously are
# rendered to SVG markup that can be saved and displayed. A basic
# set of functions have been provided for this and we expect that
# any extensions you need are easy to add. Speak to the mentors for
# advice.
# = 2 PARAMETERS ==========================================================
# Code should not contain "magic numbers". Constants that we need are
# defined and commented here.
CHR20LENGTH <- 64444167 # basepairs of the chromosome we are working with
DATAFILE <- "Chr20GeneData.tsv" # Chromosome data input. See README-DATA and
# prepareGenomeData.R for a description of the
# contents, and the code that produces it from
# database sources.
NACOLOUR <- "#AAAAAA" # Neutral grey for NA attributes
SVGFILE <- "test.svg" # Filename for the output we produce
# UTPoster prints from 24" x 36" all the way to 60" x 300".
# Let's assume legal paper size for this demo, landscape orientation, and
# subtract a 1" margin on both sides.
PAGEWIDTH <- ( 14.0 - 2) * 2.54 # in cm
PAGEHEIGHT <- ( 8.5 - 2) * 2.54 # in cm
RESOLUTION <- 150 # pixels per 2.54 cm
# = 3 PACKAGES AND FUNCTIONS ==============================================
#
# All required packages and functions are loaded from the source file below.
# You can inspect/copy/modify the source code there.
source("genomePlotFunctions.R")
# = 4 PROCESS =============================================================
# This demo code will plot genes as rectangles on a line and color the boxes.
# == 4.1 READ SOURCE DATA ==================================================
# read_tsv() is from the readr package. It is similar to base R's read.delim()
# function, but more modern.
myData <- read_tsv(DATAFILE)
# == 4.2 INITIALIZE DATA STRUCTURES ========================================
# There are many possibilities to store the data for the objects we will analyze
# and draw. Here we take a very simple approach and store gene-level data in one
# data frame.
# Entity annotations: data for each gene
myGenes <- data.frame(sym = myData$sym, # Gene symbols
start = myData$start, # start
end = myData$end, # end
strand = myData$strand, # strand
GOid = myData$GO_P, # GO annotation for "Process"
stringsAsFactors = FALSE)
# == 4.3 ANNOTATE ==========================================================
# (No computed annotations in this basic demo.)
# == 4.4 LAYOUT ============================================================
# The layout phase is where we turn data into visuals.
# Since we need to accommodate quite different types of objects, we will collect
# them in a list. Each element is itself a list that describes the
# object with all the detail we need so we can draw it out later.
myShapes <- list()
# We will call the things that we are going to draw "shapes".
# To draw our shapes, we need to define:
# - what they are
# - where they are going to be drawn
# - how large
# - with what stroke-width
# - with what colour
# - with what fill
# - ... Many other attributes can be added - perspective, labels,
# curvature, line type, shadow, gradient etc. etc.
# Let's start with some generic elements to structure our plot. We will draw the
# chromosome backbone as a line and we will add the chromosome name as a text
# element.
# At first, we are only concerned with relative positions and we will layout
# shapes into an arbitrary canvas. Later we will map this into page coordinates.
# The chromosome line will be placed on the X axis, start at (0, 0) and run
# to (1, 0)
# Chromosome backbone:
CHR20FROM <- c(0.0, 0.0)
CHR20TO <- c(1.0, 0.0)
myShapes[[1]] <- list(type = "line",
p1 = CHR20FROM,
p2 = CHR20TO,
stroke = "#4499AA", # colour of line
sw = 7.0) # stroke-width
# Next we add some descriptive text:
myShapes[[2]] <- list(type = "text",
text = "CHR 20",
centre = c(0.1, 0.1),
size = 48, # points
font = "Times",
fill = "#33AAFF")
# Next we add the genes. We will draw them as rectangles. genes on the (+)
# strand will go above the circle, genes on the (-) strand will go below the
# circle. We will place them on the circle at their fractional position on the
# chromosome. We will also give them colour:
# Colour:
# We are providing a function category2colour() to make life simple. It
# takes a vector of items, and returns a named vector of corresponding
# color values.
#
# For this demo we will color the genes by their function. Thus we define
# a basic, divergent spectrum, and map these colors to GO ids.
mySpect <- c("#f2003c", # red
"#F0A200", # orange
"#f0ea00", # yellow
"#62C923", # green
"#0A9A9B", # blue
"#1958C3", # indigo
"#8000D3", # violet
"#D0007F") # red
myGOcolours <- category2colour(sort(unique(myGenes$GOid)),
col = mySpect)
# Add each gene to the list:
for (i in 1:nrow(myGenes)) {
# The centre of the rectangle is placed above or below the line, depending
# on the strand. Calculate height first:
thisHeight <- 0.01 * (CHR20TO[1] - CHR20FROM[1]) # 1% of chromosome length
thisCentre <- c(mean(c(myGenes$start[i], myGenes$end[i])) / CHR20LENGTH, # x
0.0 + (myGenes$strand[i] * thisHeight * 0.5) ) # y
thisWidth <- abs(myGenes$start[i] - myGenes$end[i]) / CHR20LENGTH
# We use the colour for GO anotations we defined above.
myFill <- myGOcolours[myGenes$GOid[i]]
if (is.na(myFill)) {
myFill <- NACOLOUR
}
# At this scale, a typical gene is about a hair's width. We draw the outline
# of the rectangle, with a thin line, to give it a minimum width for
# visibility.
# Define a rectangle shape with these parameters:
myShapes[[length(myShapes) + 1]] <- list(type = "rect",
centre = thisCentre,
w = thisWidth,
h = thisHeight,
ang = 0,
fill = myFill,
stroke = myFill,
sw = 0.5) # points
}
# Done. All shapes are defined.
# == 4.5 PLOT ==============================================================
# cf. https://www.w3.org/TR/SVG
# === 4.5.1 Compute scale and translation
# Caution: the SVG coordinate system has its origin (0, 0) in the TOP LEFT
# corner, positive X goes right, and positive Y goes down. Here we define the
# necessary scaling and translation.
# range of coordinates:
dX <- (CHR20TO[1] - CHR20FROM[1]) * 1.1
# Given the range that needs to fit on the page, we can compute the scale:
sXY <- RESOLUTION * (PAGEWIDTH / 2.54) / dX
# We compute the dimensions of the page in pixels ...
Xpx <- RESOLUTION * (PAGEWIDTH / 2.54)
Ypx <- RESOLUTION * (PAGEHEIGHT / 2.54)
# And we compute a translation: for this demo, we move the chromosome
# to the middle of the page, and 1% to the right:
tXY <- c(0.01 * (CHR20TO[1] - CHR20FROM[1]),
Ypx / 2)
# === 4.5.2 Write SVG header
mySVG <- SVGheader()
mySVG <- c(mySVG, SVGdefinePage(Xpx, Ypx))
# === 4.5.3 Render all elements
#
for (i in 1:length(myShapes)) {
mySVG <- c(mySVG, SVGrenderElement(myShapes[[i]],
sc = sXY,
tr = tXY,
Y = Ypx))
}
# === 4.5.4 Write SVG footer
mySVG <- c(mySVG, SVGfooter())
# = 5 FINISH ==============================================================
# Write the SVG to file
writeLines(mySVG, con = SVGFILE)
# Open the SVG in the default browser to visualize
system(sprintf("open -a \"Google Chrome\" %s", SVGFILE)) # For MacOS
# Windows ???
# Linux ???
# ==== TESTS =================================================================
# ...
# [END]