-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathdata_objects.qmd
754 lines (499 loc) · 26.7 KB
/
data_objects.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
---
output: html_document
editor_options:
chunk_output_type: console
---
# Data Object Type and Structure
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
R_ver()
```
## Core data types
An R object's data type, or **mode**, defines how its values are stored in the computer. You can get an object's mode using the `typeof()` function. Note that R also has a built-in `mode()` function that will serve the same purpose with the one exception in that it will not distinguish integers from doubles.
### Numeric
The **numeric** data type is probably the simplest. It consists of numbers such as integers (e.g. `1 ,-3 ,33 ,0`) and doubles (e.g. `0.3, 12.4, -0.04, 1.0`). For example, to create a numeric (double) vector we can type:
```{r}
x <- c(1.0, -3.4, 2, 140.1)
mode(x)
```
To assess if the number is stored as an *integer* or a *double* use the `typeof()` function.
```{r}
typeof(x)
```
Note that removing the fractional part of a number when creating a numeric object does not necessarily create an integer. For example, creating what seems to be an integer object returns *double* when queried by `typeof()`:
```{r}
x <- 4
typeof(x)
```
To force R to recognize a value as an integer add an upper case `L` to the number.
```{r}
x <- 4L
typeof(x)
```
### Character
The **character** data type consists of letters or words such as `"a", "f", "project", "house value"`.
```{r}
x <- c("a", "f", "project", "house value")
typeof(x)
```
Characters can also consist of numbers represented as characters. The distinction between a character representation of a number and a numeric one is important. For example, if we have two numeric vectors `x` and `y` such as
```{r}
x <- 3
y <- 5.3
```
and we choose to sum the two variables, we get:
```{r}
x + y
```
If we repeat these same steps but instead, choose to represent the numbers `3` and `5.3` as characters we get the following error message:
```{r error=TRUE}
x <- "3"
y <- "5.3"
x + y
```
Note the use of quotes to force numbers to character mode.
### Logical
**Logical** values can take on one of two values: `TRUE` or `FALSE`. These can also be represented as `1` or `0`. For example, to create a logical vector of 4 elements, you can type
```{r}
x <- c(TRUE, FALSE, FALSE, TRUE)
```
or
```{r}
x <- as.logical(c(1,0,0,1))
```
Note that in both cases, `typeof(x)` returns `r typeof(x)`. Also note that the `1`'s and `0`'s in the last example are converted to `TRUE`'s and `FALSE`'s internally.
## Naming R objects
You can use any combination of alphanumeric characters--including dots and underscores--to name an R object. But there are a few exceptions:
* Names cannot start with a number;
* Names cannot have spaces;
* Names cannot be a standalone number such as `12` or `0.34`;
* Names cannot be a reserved word such as `if`, `else`, `function`, `TRUE`, `FALSE` and `NULL` just to name a few (to see the full list of reserved words, type `?Reserved`).
Examples of valid names include `a`, `dat2`, `cpi_index`, `.tmp`, and even a standalone dot `.` (though a dot can make reading code difficult under certain circumstances).
Examples of invalid names include `1dat`, `dat 2` (note the space between `dat` and `2`), `df-ver2` (the dash is treated as a mathematical operator), and `Inf` (the latter is a reserved word listed in the `?Reserved` help document).
You *can* mix cases, but use upper cases with caution since some letters look very much the same in both lower and upper cases (e.g. `s` and `S`).
## Derived data types
We've learned that data are stored as either numeric, character or logical, but they can carry additional *attribute* information that allow these objects to be treated in special ways by certain functions in R. These attributes define an object's **class** and can be extracted from that object via the `class()` function.
### Factor {#factors_xtr}
**Factors** are normally used to group variables into a fixed number of unique categories or **levels**. For example, a dataset may be grouped by gender or month of the year. Such data are usually loaded into R as a numeric or character data type requiring that they be converted to a factor using the `as.factor()` function.
In the following chunk of code, we create a factor from a character object.
```{r}
a <- c("M", "F", "F", "U", "F", "M", "M", "M", "F", "U")
a.fact <- as.factor(a)
```
Note that `a` is of `character` data type.
```{r}
typeof(a)
```
However, the derived object `a.fact` is now stored as an `integer`!
```{r}
typeof(a.fact)
```
Yet, when displaying the contents of `a.fact` we see character values.
```{r}
a.fact
```
How can this be? Well, `a.fact` is a more complicated object than the simple objects created thus far in that the factor is storing additional information not seen in its output. This hidden information is stored as an object **attribute**. To view these hidden attributes, use the `attributes` function.
```{r}
attributes(a.fact)
```
There are two attributes: `levels` and `class`. The `levels` attribute lists the three unique values in `a.fact`. The order in which these levels are listed reflect their *numeric* representation. So in essence, `a.fact` is storing each value as an integer that points to one of the three unique levels.
![](img/levels1.png)
So why doesn't R output the integer values when we output `a.fact`? To understand why, we first need to know that when we call the object name, R is passing that object name to the `print` function, so the following lines of code are identical.
```{r eval = FALSE}
a.fact
print(a.fact)
```
The `print` function then looks for a `class` attribute in the object. The class type instructs the `print` function on how to generate the output. Since `a.fact` has a `factor` class attribute, the `print` function is instructed to replace the integer values with the level "tags".
Naturally, this all happens behind the scenes without user intervention.
Another way to determine `a.fact`'s class type is to call the `class` function.
```{r}
class(a.fact)
```
The unique levels of a factor, and the order in which they are stored can be extracted using the `levels` function.
```{r}
levels(a.fact)
```
Remember, the order in which the levels are displayed match their integer representation.
Note that if a class attribute is not present, the `class` function will return the object's data type (though it will not distinguish between integer and double).
```{r}
class(a)
```
To appreciate the benefits of a factor, we'll first create a dataframe (dataframes are data tables whose structure will be covered later in this tutorial). One column will be assigned the `a.fact` factor and another will be assigned some random numeric values.
```{r}
x <- c(166, 47, 61, 148, 62, 123, 232, 98, 93, 110)
dat <- data.frame(x = x, gender = a.fact)
dat
```
The `gender` column is now a factor with three levels: `F`, `M` and `U`. We can use the `str()` function to view the dataframe's structure as well as its columns classes.
```{r}
str(dat)
```
Many functions other than `print` will recognize factor data types and will allow you to split the output into groups defined by the factor's unique levels. For example, to create three boxplots of the value `x`, one for each gender group `F`, `M` and `U`, type the following:
```{r fig.width = 4, fig.height=2.5,echo=2}
OP <- par( mar=c(2,2,0,0))
boxplot(x ~ gender, dat, horizontal = TRUE)
par(OP)
```
The tilde `~` operator is used in the plot function to split (or *condition*) the data into separate plots based on the factor `gender`.
Factors will prove to be quite useful in many analytical and graphical procedures as we'll see in subsequent sections.
#### Rearranging level order
A factor will define a hierarchy for its levels. When we invoked the `levels` function in the last example, you may have noted that the levels output were ordered `F`, `M` and`U`--this is the level hierarchy defined for `gender` (i.e. `F`>`M`>`U` ). This means that regardless of the order in which the factors appear in a table, anytime a plot or operation is conditioned by the factor, the grouped elements will appear in the order defined by the levels' hierarchy. When we created the boxplot from our `dat` object, the plotting function ordered the boxplot (bottom to top) following `gender`'s level hierarchy (i.e. `F` first, then `M`, then `U`).
If we wanted the boxplots to be plotted in a different order (i.e. `U` first followed by `F` then `M`) we would need to modify the `gender` column by recreating the factor object as follows:
```{r}
dat$gender <- factor(dat$gender, levels=c("U","F","M"))
str(dat)
```
The `factor` function is given the original factor values (`dat$gender`) but is also given the levels in the new order in which they are to appear(`levels=c("U","F","M")`). Now, if we recreate the boxplot, the plot order (plotted from bottom to top) will reflect the new level hierarchy.
```{r fig.width = 4, fig.height=2.5,echo=2}
OP <- par( mar=c(2,2,0,0))
boxplot(x ~ gender, dat, horizontal = TRUE)
par(OP)
```
#### Subsetting table by level and dropping levels
In this example, we can subset the table by level using the subset function. For example, to subset the values associated with `F` and `M`, type:
```{r}
dat.f <- subset(dat, gender == "F" | gender == "M")
dat.f
```
The double equality sign `==` differs from the single equality sign `=` in that the former asses a condition: it checks if the variable to the left of `==` equals the variable to the right.
However, if you display the levels associated with this new dataframe, you'll still see the level `U` even though it no longer exists in the `gender` column.
```{r}
levels(dat.f$gender)
```
This can be a nuisance when plotting the data subset.
```{r fig.width = 4, fig.height=2.5,echo=2}
OP <- par( mar=c(2,2,0,0))
boxplot(x ~ gender, dat.f, horizontal = TRUE)
par(OP)
```
Even though no records are available for `U`, the plot function allocates a slot for that level. To resolve this, we can use the `droplevels` function to remove all unused levels.
```{r}
dat.f$gender <- droplevels(dat.f$gender)
levels(dat.f$gender)
```
```{r fig.width = 4, fig.height=2,echo=2}
OP <- par( mar=c(2,2,0,0))
boxplot(x ~ gender, dat.f, horizontal = TRUE)
par(OP)
```
### Date
Date values are stored as numbers. But to be properly interpreted as a date object in R, their class attribute must be explicitly defined as a **date**. R provides many facilities to convert and manipulate dates and times, but a package called `lubridate` makes working with dates/times much easier. A separate [chapter](date_objects.html) is dedicated to the creation and manipulation of date objects.
### NA and NULL
You will find that many data files contain missing or unknown values. It may be tempting to assign these missing or unknown values a `0` but doing so can lead to many undesirable results when analyzing the data. R has two placeholders for such elements: `NA` and `NULL`.
For example, let's say that we made four measurements where the second measurement was not available but we wanted that missing value to be recorded in our table, we would encode that missing value as follows:
```{r}
x <- c(23, NA, 1.2, 5)
```
`NA` (Not Available) is a missing value indicator. It suggests that *a* value should be present but is unknown.
The `NULL` object also represents missing values but its interpretation is slightly different in that it suggests that the value does not exist or that it's not measurable.
```{r}
y <- c(23, NULL, 1.2, 5)
```
The difference between `NA` and `NULL` may seem subtle, but their interpretation in some functions can lead to different outcomes. For example, when computing the mean of `x`, R returns an `NA` value:
```{r}
mean(x)
```
This serves as a check to remind the user that one of the elements is missing. This can be overcome by setting the `na.rm` parameter to `TRUE` as in `mean(x, na.rm=TRUE)` in which case R ignores the missing value.
A `NULL` object, on the other hand, is treated differently. Since `NULL` implies that a value *should not* be present, R no longer feels the need to treat such element as questionable and allows the mean value to be computed:
```{r}
mean(y)
```
It's more common to find data tables with missing elements populated with `NA`'s than `NULL`'s so unless you have a specific reason to use `NULL` as a missing value placeholder, use `NA` instead.
#### NA data types
`NA` has different data types. By default, it's a logical variable.
```{r}
typeof(NA)
```
But it can be coerced to other types/classes such as character.
```{r}
typeof( as.character(NA))
```
It can also be coerced to a derived data type such as a date.
```{r}
class( as.Date(NA))
```
Note the use of `class` instead of `typeof` (recall that a date object is stored as a number but has a `Date` class attribute).
Alternatively, you can make use of built-in reserved words such as `NA_character_` and `NA_integer_`. Note that there is no reserved word for an `NA` date type.
```{r}
typeof(NA_character_)
typeof(NA_integer_)
```
The distinction between `NA` types can be important in certain settings. Examples of these will be highlighted in [section 9.3](dplyr.html#conditional-statements).
## Data structures
Most datasets we work with consist of batches of values such as a table of temperature values or a list of survey results. These batches are stored in R in one of several *data structures*. These include **(atomic) vectors**, **matrices**, **data frames** and **lists**.
![](img/data_structures.png)
### (Atomic) Vectors
The **atomic vector** (or **vector** for short) is the simplest data structure in R which consists of an ordered set of values of the same type and or class (e.g. numeric, character, date, etc...). A vector can be created using the *combine* function `c()` as in
```{r}
x <- c(674 , 4186 , 5308 , 5083 , 6140 , 6381)
x
```
A vector object is an indexable collection of values which allows one to access a specific index number. For example, to access the third element of `x`, type:
```{r}
x[3]
```
You can also select a subset of elements by index values using the combine function `c()`.
```{r}
x[c(1,3,4)]
```
Or, if you are interested in a range of indexed values such as index 2 through 4, use the `:` operator.
```{r}
x[2:4]
```
You can also assign new values to a specific index. For example, we can replace the second value in vector `x` with `0`.
```{r}
x[2] <- 0
x
```
Note that a vector can store any data type such as characters.
```{r}
x <- c("all", "b", "olive")
x
```
However, a vector can only be of one type. For example, you cannot mix numeric and character types as follows:
```{r}
x <- c( 1.2, 5, "Rt", "2000")
```
<br>
<div style="background-color: #FAD4AA; border-width:1px; border-style: solid; border-color: #CCCC11; border-radius: 5px; padding: 20px;">
<br>
When data of different types are combined in a vector or in an operation, R will convert the element types to the **highest common mode** following the order: <p>
<center> **NULL < logical < integer < double < character**. </center><p>
</div>
<br>
In our working example, the elements are coerced to `character`:
```{r}
typeof(x)
```
### Matrices and arrays
Matrices in R can be thought of as vectors indexed using *two* indices instead of one. For example, the following line of code creates a 3 by 3 matrix of randomly generated values. The parameters `nrow` and `ncol` define the matrix dimension and the function `runif()` generates the nine random numbers that populate this matrix.
```{r}
m <- matrix(runif(9,0,10), nrow = 3, ncol = 3)
m
```
If a higher dimension vector is desired, then use the `array()` function to generate the n-dimensional object. A 3x3x3 array can be created as follows:
```{r}
m <- array(runif(27,0,10), c(3,3,3))
m
```
Matrices and arrays can store numeric or character data types, but they cannot store both. This is not to say that you can't have a matrix of the kind
```{r echo=FALSE}
matrix(c("a","b",2,4), nrow=2)
```
but the value `2` and `4` are no longer treated as numeric values but as character values instead.
### Data frames
A **data frame** is what comes closest to our perception of a traditional data *table*. Unlike a matrix, a data frame can *mix* data types across columns (e.g. both numeric and character columns can coexist in a data frame) but data type remains the same down each column.
```{r}
name <- c("a1", "a2", "b3")
value1 <- c(23, 4, 12)
value2 <- c(1, 45, 5)
dat <- data.frame(name, value1, value2)
dat
```
To view each column's data type use the structure `str` function.
```{r}
str(dat)
```
You'll notice that the `value1` and `value2` columns are stored as `numeric` (i.e. as doubles) and not as `integer`. There is some inconsistency in R's characterization of data type. Here, `numeric` represents double whereas an integer datatype would display `integer`. For example:
```{r}
value2 <- c(1L, 45L, 5L)
dat <- data.frame(name, value1, value2)
str(dat)
```
Like a vector, elements of a data frame can be accessed by their index (aka subscripts). The first index represents the row number and the second index represents the column number. For example, to list the second row of the third column, type:
```{r}
dat[2, 3]
```
If you wish to list all rows for columns one through two, leave the first index blank:
```{r}
dat[ , 1:2 ]
```
or, if you wish to list the third row for all columns, leave the second index blank:
```{r}
dat[ 3 , ]
```
You can also reference columns by their names if you append the `$` character to the dataframe object name. For example, to list the values in the column named `value2`, type:
```{r}
dat$value2
```
Dataframes also have attributes:
```{r}
attributes(dat)
```
It's class is `data.frame`. The `names` attribute lists the column names and can be extracted using the `names` function.
```{r}
names(dat)
```
The dataframe also has a `row.names` attribute. Since we did not explicitly define row names, R simply assigned the row number as row names. You can extract the row names using the `rownames` function.
```{r}
rownames(dat)
```
Finally, to get the dimensions of a dataframe (or a matrix), use the `dim()` function.
```{r}
dim(dat)
```
The first value returned by the function represents the number of rows (`r dim(dat)[1]` rows), the second value returned by the function represents the number of columns (`r dim(dat)[2]` columns).
### Lists
A **list** is an ordered set of components stored in a 1D vector. In fact, it's another kind of vector called a **recursive vector** where each vector element can be of different data type and *structure*. This implies that each element of a list can hold complex objects such as matrices, data frames and other list objects too! Think of a list as a single column spreadsheet where each cell stores anything from a number, to a three paragraph sentence, to a five column table.
A list is constructed using the `list()` function. For example, the following list consists of 3 components: a two-column data frame (tagged as component `A`), a two element logical vector (tagged as component `B`) and a three element character vector (tagged as component `D`).
```{r}
A <- data.frame(
x = c(7.3, 29.4, 29.4, 2.9, 12.3, 7.5, 36.0, 4.8, 18.8, 4.2),
y = c(5.2, 26.6, 31.2, 2.2, 13.8, 7.8, 35.2, 8.6, 20.3, 1.1) )
B <- c(TRUE, FALSE)
D <- c("apples", "oranges", "round")
lst <- list(A = A, B = B, D = D)
```
You can view each component's structure using the `str()` function.
```{r}
str(lst)
```
Each component of a list can be extracted using the `$` symbol followed by that component's name. For example, to access component `A` from list `lst`, type:
```{r}
lst$A
```
You can also access that same component using its *numerical index*. Since `A` is the first component in `lst`, its numerical index is `1`.
```{r}
lst[[1]]
```
Note that we are using **double brackets** to extract `A`. In doing so, we are extracting `A` in its *native data format* (a data frame in this example). Had we used single brackets, `A` would have been extracted as a single component *list* regardless of its native format. The following compares the different data structure outputs between single and double bracketed indices:
```{r}
class(lst[[1]])
class(lst[1])
```
To list the names for each component in a list use the `names()` function:
```{r}
names(lst)
```
Note that components do not require names. For example, we could have created a list as follows (note the omission of `A=`, `B=`, etc...):
```{r}
lst.notags <- list(A, B, D)
```
Listing its contents displays bracketed indices instead of component names.
```{r}
lst.notags
```
When lists do not have component names, the `names()` function will return `NULL`.
```{r}
names(lst.notags)
```
It's usually good practice to assign names to components as these can provide meaningful descriptions of each component.
You'll find that many functions in R return list objects such as the linear regression model function `lm`. For example, run a regression analysis for vector elements `x` and `y` (in data frame `A`) and save the output of the regression analysis to an object called `M`:
```{r}
M <- lm( y ~ x, A)
```
Now let's verify `M`'s data structure. The following shows just the first few lines of the output.
```{r eval=FALSE}
str(M)
```
```{r echo=FALSE, results='hold'}
str(M, list.len = 1)
cat("...")
```
The contents of the regression model object `M` consists of `r length(M)` components which include various diagnostic statistics, regression coefficients, and residuals. Let's extract each component's name:
```{r}
names(M)
```
Fortunately, the regression function assigns descriptive names to each of its components making it easier to figure out what most of these components represent. For example, it's clear that the `residuals` component stores the residual values from the regression model.
```{r}
M$residuals
```
The `M` list is more complex than the simple list `lst` we created earlier. In addition to having more components, it stores a *wider range* of data types and structures. For example, element `qr` is itself a list!
```{r}
str(M$qr)
```
So, if we want to access the element `rank` in the component `qr` of list `M`, we can type:
```{r}
M$qr$rank
```
If we want to access `rank` using indices instead, and noting that `qr` is the 7^th^ component in list `M` and that `rank` is the 5th element in list `qr` we type:
```{r}
M[[7]][[5]]
```
This should illustrate the value in assigning names to list components; not only do the double brackets clutter the expression, but finding the element numbers can be daunting in a complex list structure.
## Coercing data from one type to another
Data can be coerced from one type to another. For example, to coerce the following vector object from character to numeric, use the `as.numeric()` function.
```{r}
y <- c("23.8", "6", "100.01","6")
y.c <- as.numeric(y)
y.c
```
The `as.numeric` function forces the vector to a double (you could have also used the `as.double` function). If you convert `y` to an integer, R will remove all fractional parts of the number.
```{r}
as.integer(y)
```
To convert a number to a character use `as.character()`.
```{r}
numchar <- as.character(y.c)
numchar
```
You can also coerce a number or character to a factor.
```{r}
numfac <- as.factor(y)
numfac
charfac <- as.factor(y.c)
charfac
```
There are many other coercion functions in R, a summary of some the most common ones we'll be using in this course follows:
+-----------------------------+-------------------------+
|as.character() | Convert to character |
+-----------------------------+-------------------------+
|as.numeric() or as.double() | Convert to double |
+-----------------------------+-------------------------+
|as.integer() | Convert to integer |
+-----------------------------+-------------------------+
|as.factor() | Convert to factor |
+-----------------------------+-------------------------+
|as.logical() | Convert to a Boolean |
+-----------------------------+-------------------------+
### A word of caution when converting from factors
If you need to coerce a factor whose levels represent characters to a `character` data type, use the `as.character()` function.
```{r}
char <- as.character(charfac)
class(char)
```
Numbers stored as factors can also be coerced back to numbers, but be careful, the following will not produce the expected output:
```{r}
num <- as.numeric(numfac)
num
```
The output does not look like a numeric representation of the original elements in `y`. Instead it lists the integers that *point* to the set of unique **levels** (see the earlier section on [factors](#factors_xtr)). To see the *unique* levels in `numfac`, type:
```{r}
levels(numfac)
```
There are three unique values in our vector. Recall that the order in which the levels appear in the above output represents the ordered *pointer* values. So `1` points to level `100.01`, `2` points to level `23.8`, and `3` points to level `6`.
So, to extract the actual values (and not the pointers), you must first convert the factor to character before converting to a numeric vector.
```{r}
num <- as.numeric(as.character(numfac))
num
```
## Adding metadata to objects
Earlier, you were introduced to object attributes. Attributes can be thought of as ancillary information attached to a data object's header. In other words, it is not part of the set of values stored in the object. If one or more attributes is/are present in the data object, they can be revealed using the `attributes` command. We already explored the attributes associated with a `factor` object. For example. `a.fact` has two attributes: `levels` and `class`.
```{r}
attributes(a.fact)
```
If an attribute is not explicitly defined for a data object, then a `NULL` is returned.
```{r}
attributes(a)
```
You can create your own attribute name, but it's best to avoid attribute names commonly used in R such as `class`, `dim`, `dimnames`, `names`, `row.names`.
An object's attribute is usually a part of a data object that a user does not need to explictly deal with, but it can be leveraged to store metadata (data documentation). R has a reserved attribute name, `comment`, that can be used for this purpose. You can add this attribute using the `attr` function.
```{r}
attr(a, "comment") <- "a batch of character values"
```
Note that R has a special function for the comment attribute which can be used to extract the comment attribute value.
```{r}
comment(a)
```
You can also use the `comment` function to set the comment value.
```{r}
comment(a) <- "comment added using the comment function"
comment(a)
```
You can, of course, create your own attribute. For example, to create an attribute named `dim.units` and assigning it the value `"meters"` to the object `y.c` created earlier in this tutorial, type:
```{r}
attr(y.c, "dim.units") <- "meters"
```
Note that many R packages will also implement their own set of attribute names. So you should familiarize yourself with the packages used in a worklfow before defining your own attribute name.