-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathisp_eda.qmd
398 lines (269 loc) · 13.4 KB
/
isp_eda.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
title: "EDA on Internet Service Providers"
date: last-modified
format:
html:
code-fold: true
engine: knitr
---
```{r}
#| label: utility functions
source("R/table_with_options.R")
```
We are starting our first exploratory data analysis around ISPs in the FCC NBM data set. It should be kept in mind that an ISP can be listed multiple times at the same location (offering multiple service).
Our goal is being able to take FCC data and
1. correctly identify the single ISP that is providing each reported service (i.e., deduplication of ISP names)
2. correctly identify that ISP over time (from the same program and from other FCC products).
How should we define an ISP? How can we define coverage (should a service 0/0 be considered as part of the extent of an ISP's coverage)?
We shifted a bit from exploring to trying to classify the quality of information we have from FCC about ISP.
The query that generated the data set is here:
```SQL
select
frn,
provider_id,
brand_name,
count(distinct location_id) as cnt_locations,
count(*) as cnt_locations_services,
bool_or(case when technology = 10 then true else false end) as has_copperwire,
bool_or(case when technology = 40 then true else false end) as has_coaxial_cable,
bool_or(case when technology = 50 then true else false end) as has_fiber,
bool_or(case when technology >= 70 then true else false end) as has_wireless,
bool_or(case when technology = 60 or technology = 61 then true else false end) as has_satel,
array_agg(distinct state_abbr)
from staging.june23
group by frn, provider_id, brand_name
```
The name of the columns match the FCC's documented [descriptions](https://us-fcc.app.box.com/v/bdc-data-downloads-output).
We are adding:
- `cnt_locations_services`: count of **services**, in one location you can have multiple services with different providers, technology and speeds provides (sometimes one providers can have multiple technology and/or multiple speeds)
- `cnt_locations`: count of locations covered by this specific set of brand_name, provider_id, state_abbr and technology (here if a provider declare providing different speed in that location it will not be counted)
- a serie of flag (`has`) telling if this "combo" is proving said technology
- an array listing in which states are present our "combo"
We have from FCC:
:::{.aside}
Source: [https://us-fcc.app.box.com/v/bdc-data-downloads-output](https://us-fcc.app.box.com/v/bdc-data-downloads-output) page 4
:::
- `frn` **F**CC **R**egistration **N**umber; "number of the entity that submited the data". It is supposed to be a string of 10 characters (with padding 0).
- `provider_id`: "unique identifier for the fixed service provider"
- `brand_name`: "Name of the entity or service advertised or offered to consumers."
::: {.callout-tip}
Every row is matching a combination of **unique** FRN, Provider ID and brand name.
:::
```{r}
#| label: read ISP
get_me_isp <- function(path) {
isp <- read.csv(path,
colClasses = c(frn ="character",
provider_id = "character",
brand_name = "character",
cnt_locations = "character",
cnt_locations_services = "character",
has_copperwire = "logical",
has_coaxial_cable = "logical",
has_fiber = "logical",
has_wireless = "logical",
has_satel = "logical",
array_agg = "character"))
isp[["cnt_locations"]] <- as.numeric(isp[["cnt_locations"]])
isp[["cnt_locations_services"]] <- as.numeric(isp[["cnt_locations_services"]])
return(isp)
}
isp <- get_me_isp("data/isp.csv")
```
```{r}
#| label: Display ISP
#| tbl-cap: "Raw ISP split by FRN, Provider ID and brand names"
table_with_options(isp)
```
## Numbers for context:
### Raw numbers out of the box:
Number of unique `frn`: `r length(unique(isp[["frn"]]))`
Number of unique `provider_id`: `r length(unique(isp[["provider_id"]]))`
Number of unique brand name pre cleaning: `r length(unique(isp[["brand_name"]]))`
```{r}
count_and_clean <- function(vec) {
length(unique(tolower(trimws(gsub("_", " ", vec)))))}
num_brand_name <- count_and_clean(isp[["brand_name"]])
```
Removing all capitalization and change underscore for white space help lower the number of unique brand names to: `r num_brand_name`
```{r}
isp[["clean_name"]] <- tolower(trimws(gsub("_", " ", isp[["brand_name"]])))
```
### Potential sources of errors:
- `FRN` can be wrong or not meaningfull
- `provider_id` can be wrong
- brand names can be different or evolve over time.
One case:
| frn | provider_id| brand_name| cnt_locations | cnt_locations_services | has_copperwire | has_coaxial_cable | has_fiber | has_wireless | has_satel | array_agg |
|---- | ---- | --- | --- | --- | ---- | ----| --- | --- | ----| ----| ---- | --- |
|0003738655 | 130432 | "EATEL Corp." | 83537 | 86548 | true | true | true | false | false | {LA} |
| 0009873712 | 131103 | "EATEL Corp." | 34494 | 34497 | false | true | true | false | false | {LA} |
: First example different FRN and Provider ID
Other case:
| frn | provider_id| brand_name| cnt_locations | cnt_locations_services | has_copperwire | has_coaxial_cable | has_fiber | has_wireless | has_satel | array_agg |
|---- | ---- | --- | --- | --- | ---- | ----| --- | --- | ----| ----| ---- | --- |
| 0002626984 |130008 | Acentek | 47 | 47 | true| false | true | false | false | {MN} |
| 0002626984 | 130008 | ACENTEK | 1395 | 1395 | true | false | false | false | false | {MN} |
| 0002645927 | 130008 | Acentek | 19521 | 26636 | true |false |true| true | false | {IA,MN} |
: Acentek here have the same provider ID but different FRN
### Rules for problems:
#### Less than 10 locations per rows:
```{r}
#| label: less than 10 locations
#| tbl-cap: "Row with less than 10 locations"
isp[["few_locations"]] <- NA_integer_
for (i in 1:10) {
isp[isp[["cnt_locations"]] == i, "few_locations"] <- i
}
#feel bad relying on table removing NA
few_loc <- as.data.frame(table(isp$few_locations))
names(few_loc) <- c("Number of locations", "Number of cases")
knitr::kable(few_loc)
```
Potential solutions:
- We can decide to not keep those rows
- Merge them with either other rows that is matching `provider_id` or `frn` (when this is an option)
#### More than one frn for a provider_id
```{r}
#| label: more than one frn for a provider
temp <- sapply(split(isp$frn, isp$provider_id), function(x) length(unique(x)))
dta <- data.frame(provider_id = names(temp), frn_by_provider_id = temp)
# correct lower/upper case / replace underscore by " " / some white space on both side
# triming whispace did not improve for this dataset but I should keept doing it
dta[["unique_brand_name_by_provider_id"]] <- sapply(split(isp$brand_name, isp$provider_id),
count_and_clean)
dta[["same_state_by_provider_id"]] <- sapply(split(isp$array_agg, isp$provider_id), function(x) length(unique(x)))
isp <- merge(isp, dta, by.x = "provider_id", by.y = "provider_id",
all.x = TRUE, all.y = TRUE)
more_frn_than_provider <- subset(isp, isp$frn_by_provider_id > 1)
table_with_options(more_frn_than_provider)
```
Unique provider_id + brand_name are kind of "green" (for one time frame):
Number of green isp: `r nrow(isp[isp$unique_brand_name_by_provider_id == 1,])`
We can have one `provider_id` with multiple `frn` and same or not `brand_name` (see TSC for example / 150266)
It seems:
- Windstream has 37 different `frn`: we can maybe test if it has windstream in it's name ..
- Otelco/GoNet (18 cases)
- Rally Networks/Oregon Telephone Company, is their frn wrongs ? (17 cases)
- Same Provider for differents `frn` and `brand_name` in Minesota (MN) (16 cases)
- `160127` I do not see any kind of specific pattern for this one
- `131486` seems to be RiverStreet Networks with various `frn` (13 cases) -> will be catch by unique_brand_name_by_provider_id
- `190233` multiple brand name and frn but seems to be in Texas and Oklahoma (13)
- `131226` seems to be Fastwyre Broadband divided by technology and states (12 cases) -> will be catch by unique_brand_name_by_provider_id
- `130804` seems to be Mediacom (+ Bolt) with different states and names indicating their states (11 cases)
- Google Fiber (`240041`) seems to be have `frn` split by states (with a weird 'Webpass, Inc.') (11 cases) -> will be catch by unique_brand_name_by_provider_id (except Webpass, Inc which is weird, technology is 70 that I should correct)
- AT&T Inc (`130077`) multiple frn (10 cases) -> filter by unique_brand_name_by_provider_id
- 130079 = Astound_Broadband (10 cases) -> will be catch by unique_brand_name_by_provider_id
- Verizon -> filter by unique_brand_name_by_provider_id
- long ling (130757 ) & co are problematics (multiples names / one provider / 3 states )
- `130906` is also hard to fix
Assuming that same name (clean version) + same `provider_id` provide us with a unique ISP, it helps move to greensih:
```{r}
table_with_options(more_frn_than_provider[more_frn_than_provider$unique_brand_name_by_provider_id == 1,])
```
This is removing 194 out of 1170.
#### Same provider_id and same states
Not too sure about this one.
#### More than one provider_id for a cleaned brand name
This is the case for "EATEL".
```{r}
temp <- sapply(split(isp$provider_id, isp$clean_name),
function(x) length(unique(x)))
dta <- data.frame(clean_name = names(temp), provider_id_by_clean_name = temp)
isp <- merge(isp, dta, by.x = "clean_name", by.y = "clean_name",
all.x = TRUE, all.y = TRUE)
provider_id_by_clean_name <- subset(isp, isp$provider_id_by_clean_name > 1)
table_with_options(provider_id_by_clean_name)
```
This could also move 63 cases in the greenish spot. -> nop
## What are our options/next steps:
- Make a column "ready to go"
```{r}
# if it has unique brand name and frn by provider id id should be good
temp <- isp$frn_by_provider_id + isp$unique_brand_name_by_provider_id
isp[["rdy_to_go"]] <- ifelse(temp == 2, "green", "not green")
# cases where we have a unique frn and provider id but not unique brand name
# default of olive is that we need to pick a name out of more than one
temp <- ifelse(isp$frn_by_provider_id == 1, 1, 0) + ifelse(isp$unique_brand_name_by_provider_id > 1, 1, 0)
isp[temp == 2, "rdy_to_go"] <- "olive"
# the few locations should be "red" and maybe dropped later
isp[which(isp$few_locations == "few locations"), "rdy_to_go"] <- "red"
table_with_options(isp)
```
A good example could be `131167` and how we can discriminate Orbitel communications. We can also prob. raise the bar of "few locations".
A quick summary of where we are:
```{r}
table(isp[["rdy_to_go"]])
```
- Make an id <---> provider_id / frn / brand_name table
# Typology of ISP
The data was generated from June 23 FCC release and assumed that an FRN = ISP.
Can we guess who is a small ISP?
```{r}
# con <- cori.db::connect_to_db("proj_calix")
# bob <- DBI::dbReadTable(con, "frn_desc")
# DBI::dbDisconnect(con)
# bob[["is_calix"]] <- NULL
# write.csv(bob, "data/frn_desc.csv")
frn_desc <- read.csv("data/frn_desc.csv")
#classInt::classIntervals(cnt_locations, n = 20, style = "jenks")
# not that great
table_with_options(frn_desc)
```
```{r}
cnt_locations <- frn_desc[["cnt_locations"]]
summary(cnt_locations)
```
```{r}
#| column: margin
boxplot(cnt_locations, horizontal = TRUE, col = 2, border = 2, frame = F, main = "Count of locations per ISP")
```
Some ISP are declaraing covering a huge number of locations. Some low counts are probably errors.
Count of FRN with a less than 10 locations: `r nrow(frn_desc[frn_desc$cnt_locations < 10,])`
Count of FRN with more than 500 000 locations: `r nrow(frn_desc[frn_desc$cnt_locations > 500000,])`
```{r}
frn_desc$n_states <- lengths(strsplit(gsub("\\{|\\}", "", frn_desc$states), ","))
```
If we filter them out (removing 110 cases):
::: {.panel-tabset}
## 100 000
```{r}
location_filter <- 100000
frn <- frn_desc[frn_desc$cnt_locations >= 10 & frn_desc$cnt_locations <= location_filter ,]
# dim(frn_desc)
# dim(frn)
hist(frn$cnt_locations, col = 2,
main = sprintf("Less than %s", location_filter), xlab = "count locations")
```
## 10 000
```{r}
location_filter <- 10000
frn <- frn_desc[frn_desc$cnt_locations >= 10 & frn_desc$cnt_locations <= location_filter ,]
# dim(frn_desc)
# dim(frn)
hist(frn$cnt_locations, col = 2,
main = sprintf("Less than %s", location_filter), xlab = "count locations")
```
:::
List of ISP that the Broadband team that are good reference of small provider:
| Name | FRN | Locations |
| --- | --- | --- |
| Newport Utilities, TN | 0027152438 |15383 |
| SandyNet, OR | 0004119376 | 4439|
| ECFiber, VT| 0027379676 | 22926 |
| Maple Broadband, VT| 0032366692 | 315|
| Uplink|0026218602| 1611|
| Black Bear Fiber|0025132648|1583|
| QCOL|0019663095|5610|
| Salsgiver|0011167079|29941|
| All Points Broadband|0023524705|107803|
| Marquette-Adams Telephone co-op |0003774023|130783 |
| USI fiber |0017096538|71466|
| Scott county telephone co | 0002069862|7829|
| PANGAEA |0016202236| 8410|
| Blue Mountain Networks |0005450507|310013|
Side notes:
- Newport Utilities = NUconnect
- SandyNet, OR = City of Sandy, OR
- USI FIber =
- Blue Mountain Networks = Blue Ridge Mountain Electric Membership Corporation