-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex_static.html
631 lines (532 loc) · 45.6 KB
/
index_static.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
<meta name=viewport content='width=800'>
<meta name="generator" content="HTML Tidy for Linux/x86 (vers 11 February 2007), see www.w3.org">
<style type="text/css">
/* Color scheme stolen from Sergey Karayev */
a {
color: #1772d0;
text-decoration:none;
}
a:focus, a:hover {
color: #f09228;
text-decoration:none;
}
body,td,th {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 12px
}
strong {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 12px;
}
heading {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 22 px;
}
papertitle {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 13px;
font-weight: 700
}
name {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 40px;
}
.fade {
transition: opacity .2s ease-in-out;
-moz-transition: opacity .2s ease-in-out;
-webkit-transition: opacity .2s ease-in-out;
}
img {
display: inline;
margin: 0 auto;
width: 100%;
}
.image-cropper {
width: 250px;
height: 270px;
position: relative;
overflow: hidden;
border-radius: 50%;
}
.fa {
padding: 12px;
font-size: 21px;
width: 21px;
text-align: center;
text-decoration: none;
margin: 5px 2px;
border-radius: 50%;
}
.fa:hover {
opacity: 0.7;
}
.fa-facebook {
background: #3B5998;
color: white;
}
.fa-twitter {
background: #55ACEE;
color: white;
}
.fa-google {
background: #dd4b39;
color: white;
}
.fa-linkedin {
background: #007bb5;
color: white;
}
.fa-instagram {
background: #125688;
color: white;
}
.fa-skype {
background: #00aff0;
color: white;
}
</style>
<link rel="icon" type="image" href="img/logo.jpg">
<title>Abhay Kumar</title>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<link href='http://fonts.googleapis.com/css?family=Lato:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
</head>
<body>
<table width="800" border="0" align="center" cellspacing="0" cellpadding="0">
<tr>
<td>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="67%" valign="middle">
<div id="top">
<p align="center">
<name>Abhay Kumar</name> <br><br> </p>
</div>
</font>
<p style="text-align:justify"></a> I graduated with M.S. in <a href="https://www.cs.wisc.edu/"> Computer Sciences</a> from <a href="https://www.wisc.edu/"> University of Wisconsin-Madison </a> <a href="https://www.cs.wisc.edu/">(UWM/UWisc)</a>. My research interests lie primarily in Deep Learning and its application to Computer Vision, Natural Language Processing and recommendation systems. Over the years, I have gained strong academic background, relevant work experience, and research aptitude <b>(12+ publications, 230+ citations, h-index: 6, i-10 index: 4)</b>.
<br><br>
Previously, I worked at <a href="http://www.samsung.com/in/aboutsamsung/samsungelectronics/india/rnd/">Samsung R&D Institute- India, Bangalore</a> on <a href=" https://www.samsung.com/us/explore/bixby/"> Bixby</a> - Artificial Intelligence based smart assistant, leveraging deep learning technologies. I was awarded the <a href="img/SCA.jpg"> <b> Samsung Citizen Award</b> </a>under technology excellence category for the outstanding contribution in 2017-18. For my excellent contribution to research, publications and patents, I was again awarded with the <a href="img/SCA2.jpg"> <b> Samsung Citizen Award </b> </a> in the "Innovator" category in 2019.
<br><br>
Prior to that, I did my undergraduation from <a href="https://www.iitk.ac.in/">Indian Institute of Technology Kanpur </a> <a href="https://www.iitk.ac.in/"> (IIT Kanpur)</a> with a major in Electrical Engineering. I was recipient of <b>Academic Excellence Award</b> for the outstanding academic performance in two consecutive years <a href="img/AEA_1.jpg">2013-14</a> and <a href="img/AEA_2.jpg">2014-15</a>.
<br><br>
Research Interests:- Multimodal (Image, Video, Speech) Signal Processing, Computer Vision, NLP, Machine Learning, Deep Learning & Optimization
<br><br> </p>
<p align=center>
<a target="_blank" href="https://scholar.google.co.in/citations?hl=en&user=hMTQZDQAAAAJ">Google Scholar</a>  / 
<a target="_blank" href="https://www.linkedin.com/in/abhaykumar3/">LinkedIn</a>  / 
<a target="_blank" href="mailto:abhay.kumar@wisc.edu">Wisc email</a>  / 
<a target="_blank" href="mailto:abykumar12011@gmail.com">Gmail</a>  / 
<a href="#bottom">Social medial links</a>
<!--  /  <a href="files/TODO.pdf">Resume</a> -->
</p>
</td>
<td width="33%"><img class="image-cropper" src="img/abhay2.jpg"></td>
</tr>
</tr>
<br>
</table>
<table width="100%" align="center" border="1" cellspacing="0" cellpadding="20" background-color= "#FFFFE0">
<tr>
<td> <font size="+0.5">
Citations a/c to <a href="https://scholar.google.co.in/citations?hl=en&user=hMTQZDQAAAAJ"> Google Scholar: </a> <b>(12+ publications, 230+ citations, h-index: 6, i-10 index: 4)</b>.</font>
<!-- <iframe src="https://pages.cs.wisc.edu/~abhayk/citations.php?id=hMTQZDQAAAAJ&lang=en" name="meiniframe" border="0" allowtransparency="true" width="100%" height="200" frameborder="0"></iframe> -->
<!-- <iframe src="https://secret-dusk-94803.herokuapp.com/?id=hMTQZDQAAAAJ&lang=en" name="meiniframe" border="0" allowtransparency="true" width="100%" height="200" frameborder="0"></iframe> -->
</td>
</tr>
</table>
<!-- UPDATES -->
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td>
<heading style="font-size:22px">Updates</heading>
</td>
</tr><tr>
<td> <ul style="list-style-type:circle;">
<li> <font color='red'>[Not updated since 2021] </font> </li>
<li>[June 2021] Reviewer for <a href="https://2021.emnlp.org/"> The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP-2021) </a>
<li>[Mar 2021] Member of the Scientific Review Committee for <a href="https://www.miccai2020.org/en/"> 24th International Conference On Medical Image Computing & Computer Assisted Intervention (MICCAI-2021). </a> </li>
<li>[Mar 2020] Member of the Scientific Review Committee for <a href="https://www.miccai2020.org/en/"> 23rd International Conference On Medical Image Computing & Computer Assisted Intervention (MICCAI-2020). </a> </li>
<li>[Dec 2019] Got Research Assistantship at University of Wisconsin-Madison. </li>
<li>[Aug 2019] Got Teaching Assistantship at University of Wisconsin-Madison. </li>
<li>[Apr 2019] Also got admission acceptance offers from UCSD, UMass, USC, and UMD.</li>
<li>[Feb 2019] Accepted admit offer from University of Wisconsin-Madison. </li>
</ul>
</td>
</tr>
</table>
<!-- COMPETITIONS -->
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td>
<heading style="font-size:22px">Competitions</heading>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/ms_logo.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://competitions.codalab.org/competitions/20616#learn_the_details">
<papertitle>Microsoft AI Challenge India 2018</papertitle></a><br>
<em>Phase-1 Rank: 2nd | Phase-2 Rank : 6th (Over 2000 teams participated)</em><br>
<p style="text-align:justify"> Problem Statement: "Given a user query and candidate passages corresponding to each, the task is to mark the most relevant passage which contains the answer to the user query. As search engines evolve to respond to speech inputs and as usage of ambient devices like speakers grow in the society etc. returning 10 blue links to a search query is not always desirable. At Bing.com, our aim is to serve answer to questions directly without users having to search through the 10 blue links."
<br><br>
<a href="img/maic_leaderboard.png">leaderboard (22/01/2018) </a> | <a href="img/maic_cert.jpg">certificate</a> | <a href="https://competitions.codalab.org/competitions/20616#learn_the_details">problem statement</a> | <a href="https://competitions.codalab.org/competitions/20616#results">live leaderboard</a>
</a> </p>
</td>
</tr>
<tr>
<td>
<heading style="font-size:22px">Publications</heading>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper_2.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://www.isca-speech.org/archive/Interspeech_2018/pdfs/1811.pdf">
<papertitle>Speech Emotion Recognition Using Spectrogram & Phoneme Embedding</papertitle></a><br>
<em>INTERSPEECH 2018 </em><br>
<p style="text-align:justify">This paper proposes a speech emotion recognition method based on phoneme sequence and spectrogram. Both phoneme sequence and spectrogram retain emotion contents of speech which is missed if the speech is converted into text. We performed various experiments with different kinds of deep neural networks with phoneme and spectrogram as inputs. Three of those network architectures are presented here that helped to achieve better accuracy when compared to the state-of-the-art methods on benchmark dataset. A phoneme and spectrogram combined CNN model proved to be most accurate in recognizing emotions on IEMOCAP data. We achieved more than 4% increase in overall accuracy and average class accuracy as compared to the existing state-of-the-art methods.
<br><br>
<a href="https://www.isca-speech.org/archive/Interspeech_2018/pdfs/1811.pdf">paper link</a> | <a href="files/paper2.pdf">pdf</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/ijcai.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://arxiv.org/abs/1906.08873">
<papertitle>Learning Discriminative features using Center Loss and Reconstruction as Regularizer for Speech Emotion Recognition</papertitle></a><br>
<em> In IJCAI Workshop on Artificial Intelligence in Affective Computing [ACCEPTED] </em> <br>
<p style="text-align:justify">This paper proposes a Convolutional Neural Network (CNN) inspired by Multitask Learning (MTL) and based on speech features trained under the joint supervision of softmax loss and center loss, a powerful metric learning strategy, for the recognition of emotion in speech. Speech features such as Spectrograms and Mel-frequency Cepstral Coefficient s (MFCCs) help retain emotion-related low-level characteristics in speech. We experimented with several Deep Neural Network (DNN) architectures that take in speech features as input and trained them under both softmax and center loss, which resulted in highly discriminative features ideal for Speech Emotion Recognition (SER). Our networks also employ a regularizing effect by simultaneously performing the auxiliary task of reconstructing the input speech features. This sharing of representations among related tasks enables our network to better generalize the original task of SER. Some of our proposed networks contain far fewer parameters when compared to state-of-the-art architectures.
<br><br>
<a href="https://arxiv.org/abs/1906.08873">paper link </a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper1.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://ieeexplore.ieee.org/document/7249385">
<papertitle>Hybrid Maximum Depth-kNN Method for Real-Time Node Tracking using Multi-Sensor Data</papertitle></a><br>
<em> IEEE International Conference on Communications (ICC) 2015, London, UK </em> <br>
<p style="text-align:justify">In this paper, a hybrid MD-kNN method for real time sensor node tracking is proposed. The method combines two individual location hypothesis functions obtained from generalized maximum depth and generalized kNN methods. The individual location hypothesis functions are themselves obtained from multiple sensors measuring visible light, humidity, temperature, acoustics, and link quality. The hybrid MD-kNN method therefore combines the lower computational power of maximum depth and outlier rejection ability of kNN method to realize a robust real time localization method. Additionally, this method does not require the assumption of an underlying distribution under non-line-of-sight (NLOS) conditions. Additional novelty of this method is the utilization of multivariate data obtained from multiple sensors which has hitherto not been used. The affine invariance property of the hybrid MD-kNN method is proved and its robustness is illustrated in the context of node localization. Experimental results on the Intel Berkeley research data set indicates reasonable improvements over conventional methods available in literature.
<br><br>
<a href="https://ieeexplore.ieee.org/document/7249385">paper link </a> | <a href="files/paper1.pdf">pdf</a> | <a href="files/paper1_presentation.pdf">presentation</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper10.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://ieeexplore.ieee.org/document/9004020">
<papertitle>Emoception: An Inception Inspired Efficient Speech Emotion Recognition Network</papertitle></a><br>
<em>2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore </em><br>
<p style="text-align:justify"> This research proposes a Deep Neural Network architecture for Speech Emotion Recognition called Emoception, which takes inspiration from Inception modules. The network takes speech features like Mel-Frequency Spectral Coefficients (MFSC) or Mel-Frequency Cepstral Coefficients (MFCC) as input and recognizes the relevant emotion in the speech. We use USC-IEMOCAP dataset for training but the limited amount of training data and large depth of the network makes the network prone to overfitting, reducing validation accuracy. The Emoception network overcomes this problem by extending in width without increase in computational cost. We also employ a powerful regularization technique, Multi-Task Learning (MTL) to make the network robust. The model using MFSC input with MTL increases the accuracy by 1.6% vis-à-vis Emoception without MTL. We report an overall accuracy improvement of around 4.6% compared to the existing state-of-art methods for four emotion classes on IEMOCAP dataset.
<br><br>
<a href="https://ieeexplore.ieee.org/document/9004020">paper link</a> | <a href="files/paper10_asru.pdf">pdf</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper7.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://link.springer.com/chapter/10.1007/978-3-030-23281-8_5">
<papertitle>Bidirectional Transformer based Multi-Task Learning for Natural Language Understanding</papertitle></a><br>
<em>24th International Conference on Applications of Natural Language to Information Systems, Salford, United Kingdom </em><br>
<p style="text-align:justify"> We propose a multi-task learning-based framework for natural language understanding tasks like sentiment and topic classification. We make use of bi-directional transformer based architecture to generate encoded representations from given input followed by task-specific layers for classification. Multi-Task learning (MTL) based framework make use of a different set of tasks in parallel, as a kind of additional regularization, to improve the generalizability of the trained model over individual tasks. We introduced a task-specific auxiliary problem using the k-means clustering algorithm to be trained in parallel with main tasks to reduce the model’s generalization error on the main task. POS-tagging was also used as one of the auxiliary tasks. We also trained multiple benchmark classification datasets in parallel to improve the effectiveness of our bidirectional transformer based network across all the datasets. Our proposed MTL based transformer network im-proved state-of-the-art overall accuracy of Movie Review (MR), AG News, and Stanford Sentiment Treebank (SST-2) corpus by 6%, 1.4%, and 3.3% respectively.
<br><br>
<a href="https://link.springer.com/chapter/10.1007/978-3-030-23281-8_5">paper link</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper8.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://link.springer.com/chapter/10.1007/978-3-030-23281-8_7">
<papertitle>Deceptive Reviews Detection using Deep Learning Techniques</papertitle></a><br>
<em>24th International Conference on Applications of Natural Language to Information Systems, Salford, United Kingdom </em><br>
<p style="text-align:justify"> With the increasing influence of online reviews in shaping customer decision-making and purchasing behavior, many unscrupulous businesses have a vested interest in generating and posting deceptive reviews. Deceptive reviews are fictitious reviews written deliberately to sound authentic and deceive the consumers. Traditional deceptive reviews detection methods are based on various handcrafted features, including linguistic and psychological, which characterize the deceptive reviews. However, the proposed deep learning methods have better self-adaptability to extract the desired features implicitly and outperform all traditional methods. We have purposed multiple Deep Neural Network (DNN) based approaches for deceptive reviews detection and have compared the performances of these models on multiple benchmark datasets. Additionally, we have identified a common problem of handling the variable lengths of these reviews. We have purposed two different methods – Multi-Instance Learning and Hierarchical architecture to handle the variable length review texts. Experimental results on multiple benchmark datasets of deceptive reviews have outperformed existing state-of-the-art. We evaluated the performance of the proposed method on other review-related task-like review sentiment detection as well and achieved state-of-the-art accuracies on two benchmark datasets for the same.
<br><br>
<a href="https://link.springer.com/chapter/10.1007/978-3-030-23281-8_7">paper link</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper9.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://arxiv.org/abs/1908.08652">
<papertitle> MTCNet: Multi-Task Learning Paradigm for Crowd Count Estimation </papertitle></a><br>
<em>16th IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), Taipei, Taiwan <b> [ACCEPTED] </b> <b> [update: WITHDRAWN] </b></em><br>
<p style="text-align:justify"> We propose a Multi-Task Learning (MTL) paradigm based deep neural network architecture, called MTCNet (Multi-Task Crowd Network) for crowd density and count estimation. Crowd count estimation is challenging due to the non-uniform scale variations and the arbitrary perspective of an individual image. The proposed model has two related tasks, with Crowd Density Estimation as the main task and Crowd-Count Group Classification as the auxiliary task. The auxiliary task helps in capturing the relevant scale-related information to improve the performance of the main task. The main task model comprises two blocks: VGG-16 front-end for feature extraction and a dilated Convolutional Neural Network for density map generation. The auxiliary task model shares the same front-end as the main task, followed by a CNN classifier. Our proposed network achieves 5.8% and 14.9% lower Mean Absolute Error (MAE) than the state-of-the-art methods on ShanghaiTech dataset without using any data augmentation. Our model also outperforms with 10.5% lower MAE on UCF_CC_50 dataset.
<br><br>
<a href="https://arxiv.org/abs/1908.08652">paper link</a>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper3.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://ieeexplore.ieee.org/document/8987153">
<papertitle>Exploiting SIFT Descriptor for Rotation Invariant Convolutional Neural Network</papertitle></a><br>
<em>15th IEEE India Council International Conference (INDICON 2018) </em><br>
<p style="text-align:justify"> This paper presents a novel approach to exploit the distinctive invariant features in convolutional neural network. The proposed CNN model uses Scale Invariant Feature Transform (SIFT) descriptor instead of the maxpooling layer. Max-pooling layer discards the pose, i.e., translational and rotational relationship between the low-level features, and hence unable to capture the spatial hierarchies between low and high level features. The SIFT descriptor layer captures the orientation and the spatial relationship of the features extracted by convolutional layer. The proposed SIFT Descriptor CNN therefore combines the feature extraction capabilities of CNN model and rotation invariance of SIFT descriptor. Experimental results on the MNIST and fashionMNIST datasets indicates reasonable improvements over conventional methods available in literature.
<br><br>
<a href="https://arxiv.org/abs/1904.00197">arXiv link</a> | <a href="https://ieeexplore.ieee.org/document/8987153"> IEEE link </a> |
<a href="files/paper3_ppt.pdf">presentation</a> | <a href="img/indicon_cert.jpg">certificate</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper4.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://arxiv.org/pdf/1906.05682">
<papertitle>Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition</papertitle></a><br>
<em>20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France <b> [IN PRESS] </b></em><br>
<p style="text-align:justify"> This paper proposes a Residual Convolutional Neural Network (ResNet) based on speech features and trained under Focal Loss to recognize emotion in speech. Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCCs) have shown the ability to characterize emotion better than just plain text. Further Focal Loss, first used in One-Stage Object Detectors, has shown the ability to focus the training process more towards hard-examples and down-weight the loss assigned to well-classified examples, thus preventing the model from being overwhelmed by easily classifiable examples. After experimenting with several Deep Neural Network (DNN) architectures, we propose a ResNet, which takes in Spectrogram or MFCC as input and supervised by Focal Loss, ideal for speech inputs where there exists a large class imbalance. Maintaining continuity with previous work in this area, we have used the University of Southern California’s Interactive Emotional Motion Capture (USC-IEMOCAP) database’s Improvised Topics in this work. This dataset is ideal for our work, as there exists a significant class imbalance among the various emotions. Our best model achieved a 3.4% improvement in overall accuracy and a 2.8% improvement in class accuracy when compared to existing state-of-the-art methods.
<br><br>
<a href="https://arxiv.org/pdf/1906.05682">paper link</a> |
<a href="files/paper4_ppt_308_short.pdf">presentation</a> | <a href="files/paper4_308-poster.pdf">poster</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper5.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://arxiv.org/pdf/1906.04914">
<papertitle>From Fully Supervised to Zero Shot Settings for Twitter Hashtag Recommendation</papertitle></a><br>
<em>20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France <b> [IN PRESS] </b></em><br>
<p style="text-align:justify"> We propose a comprehensive end-to-end pipeline for Twitter hashtags recommendation system including data collection, supervised training setting and zero shot training setting. In the supervised training setting, we have proposed and compared the performance of various deep learning architectures, namely Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Transformer Network. However, it is not feasible to collect data for all possible hashtag labels and train a classifier model on them. To overcome this limitation, we propose a Zero Shot Learning (ZSL) paradigm for predicting unseen hashtag labels by learning the relationship between the semantic space of tweets and the embedding space of hashtag labels. We evaluated various state-of-the-art ZSL methods like Convex combination of Semantic Embedding (ConSE), Embarrassingly Simple Zero Shot Learning (ESZSL) and Deep Embedding Model for Zero Shot Learning (DEM-ZSL) for the hashtag recommendation task. We demonstrate the effectiveness and scalability of ZSL methods for the recommendation of unseen hashtags. To the best of our knowledge, this is the first quantitative evaluation of ZSL methods to date for unseen hashtags recommendations from tweet text.
<br><br>
<a href="https://arxiv.org/pdf/1906.04914">paper link</a> |
<a href="files/paper5_ppt_325_short.pdf">presentation</a> | <a href="img/cicling_certificate_1.jpg">certificate</a> | <a href="files/paper5_325-poster.pdf">poster</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/paper6.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://arxiv.org/pdf/1906.05681">
<papertitle>Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions</papertitle></a><br>
<em>20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France <b> [IN PRESS] </b></em><br>
<p style="text-align:justify"> This paper proposes a speech emotion recognition method based on speech features and speech transcriptions (text). Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCC) help retain emotion-related low-level characteristics in speech whereas text helps capture semantic meaning, both of which help in different aspects of emotion detection. We experimented with several Deep Neural Network (DNN) architectures, which take in different combinations of speech features and text as inputs. The proposed network architectures achieve higher accuracies when compared to state-of-the-art methods on a benchmark dataset. The combined MFCC-Text Convolutional Neural Network (CNN) model proved to be the most accurate in recognizing emotions in IEMOCAP data. We achieved an almost 7% increase in overall accuracy as well as an improvement of 5.6% in average class accuracy when compared to existing state-of-the-art methods.
<br><br>
<a href="https://arxiv.org/pdf/1906.05681">paper link</a> |
<a href="files/paper6_ppt_307_short.pdf">presentation</a> | <a href="img/cicling_certificate_2.jpg">certificate</a> | <a href="files/paper6_307-poster.pdf">poster</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/cacnn.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="https://arxiv.org/abs/1906.09986">
<papertitle>Visual Context-aware Convolution Filters for Transformation-invariant Neural Network</papertitle></a><br>
<em> </b></em>
<p style="text-align:justify"> We propose a novel visual context-aware filter generation module which incorporates contextual information present in images into Convolutional Neural Networks (CNNs). In contrast to traditional CNNs, we do not employ the same set of learned convolution filters for all input image instances. Our proposed input-conditioned convolution filters when combined with techniques inspired by Multi-instance learning and max-pooling, results in a transformation-invariant neural network. We investigated the performance of our proposed framework on three MNIST variations, which covers both rotation and scaling variance, and achieved 1.13% error on MNIST-rot-12k, 1.12% error on Half-rotated MNIST and 0.68% error on Scaling MNIST, which is significantly better than the state-of-the-art results. We make use of visualization to further prove the effectiveness of our visual context-aware convolution filters. Our proposed visual context-aware convolution filter generation framework can also serve as a plugin for any CNN based architecture and enhance its modeling capacity.
<br><br>
<a href="https://arxiv.org/abs/1906.09986">paper link</a>
</a> </p>
</td>
</tr>
<tr>
<td>
<heading style="font-size:22px">Undergraduate Research Projects</heading>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/cs771.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="files/cs771_report.pdf">
<papertitle>Classification of objects from the stream of Surveillance videos</papertitle></a><br>
<em>Supervisor: <a href="https://www.iitk.ac.in/new/dr-harish-karnick">Dr Harish Karnick</a>, Indian Institute of Technology Kanpur</em><br>
<p style="text-align:justify">The project aimed at building a system for detecting and classifying objects in a video stream into three classes- Pedestrian, Two-Wheeler and Four-Wheeler.
<br><br>
• Implemented various state-of-art Background Subtraction algorithms for detecting the object by performing a connected component analysis. <br>
• Extracted SIFT features from different image patches obtained using the bounding box of the annotated frames. <br>
• Trained different ciassifiers namely, SVM, random forest and decision tree to predict the label of detected objects.<br>
• Extracied features from images using pre-trained convolutional neural networks (CNN) on the ILSVRC 2012 dataset and used caffe for extracting features.<br>
• For detection, implemented Selective Search, which generates all possible object locations in a given image. It is a data-dnven approach which combines the strength of segmentation and exhaustive search.
<br><br>
<a href="files/cs771_report.pdf">report</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/ee698m.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="files/EE698M_project_report.pdf">
<papertitle>Direct Content Analysis for Scene Intensity Estimation in Movies using low-level multimodal features</papertitle></a><br>
<em>Supervisor: <a href="http://home.iitk.ac.in/~tanaya/Home.html">Dr. Tanaya Guha</a>, Indian Institute of Technology Kanpur</em><br>
<p style="text-align:justify">The project aims at developing a computational model to estimate scene intensity profile in movies or videos. Scene intensity can be understood as a measure of excitement or activity in a scene.
<br><br>
• Exploited computable video features namely, average shot length, color variance, motion content, lighting key, motion energy, harmonicity etc. As video features to compute scene intensity.
<br>
• Incorporated facial emotion detection using the optical flow of facial interest points
<br>
• Created a small dataset by manually timestamping scene boundaries and conducted a survey asking people how critical they consider of these scenes in a particular movie.
<br>
• Various cinematic principles and video features is being exploited for robust scene intensity estimation.
<br><br>
Selected as the <b>best project</b> in the course comprising of around 30 students.
<br><br>
<a href="files/EE698M_project_report.pdf">report</a> | <a href="files/EE698M_Presentation.pdf">presentation</a> | <a href="files/EE698M_Paper Presentation.pdf">paper presentation</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/ee629_convex.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="files/EE609A_Convex.pdf">
<papertitle>Dictionary Learning and Sparse representation based Image Processing Applications </papertitle></a><br>
<em>Supervisor: <a href = "http://home.iitk.ac.in/~ketan/">Dr. Ketan Rajawat</a>, Indian Institute of Technology Kanpur</em><br>
<p style="text-align:justify">The project aimed at exploring various dictionary learning algorithms(k-SVD, MOD, OMP)and implementing sparse representation based application in Image Processing like Image denoising, inpainting, classification, compression etc.
<br><br>
• Implemented image inpainting(removing corrupted pixels in the target region)using sparse representation on dictionary learned from randomly sampling patches from the source region of the image. <br>
• Compared Sparse based Image denoising using overcomplete DCT dictionary with state-of-art methods. <br>
• Implemented Sparse representation based Image classification on MNIST dataset.
<br><br>
<a href="files/EE609A_Convex.pdf">report</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/ee627.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="files/EE627A_TP.pdf">
<papertitle>Age and Gender Recognition of a Speaker from Interactive voice response (IVR)systems</papertitle></a><br>
<em>Supervisor: <a href="http://home.iitk.ac.in/~rhegde/">Dr R.M.Hegde</a>, Indian Institute of Technology Kanpur</em><br>
<p style="text-align:justify">The project aimed at building a system for Age and Gender Recognition using speech features.
<br><br>
• Pre-processed and extracted useful long-term and short-term features including MFCC (Mel Frequency Cepstral Coefficients), Shifted Delta Cepstral (SDC), pitch, and first three formants information from the speech signals.
<br>
• Trained 128-mixture GMM model with MAP adaptation for MFCCs and used WSNMF for dimensionality reduction.
<br>
• Analysed performance of various machine learning classifiers such as Support Vector Machines (SVM), Random Forests, Decision Trees.
<br><br>
<a href="files/EE627A_TP.pdf">report</a> | <a href="files/ee627_presentation.pdf">presentation</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/ee604.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="files/EE604_tp.pdf">
<papertitle>Analysis of Benford's Law in Digital Image Forensics</papertitle></a><br>
<em>Supervisor: <a href="http://www.iitk.ac.in/ee/people/fac-pages/sumana.shtml">Dr Sumana Gupta </a>, Indian Institute of Technology Kanpur</em><br>
<p style="text-align:justify"> Analysed various application of Benford's law in Digital Image Forensics.
<br><br>
• JPEG and JPEG2000 approximately follow this law and JPEG2000 found closer to the given law. The non-following of this law in different forensics setu ps can be used as fingerprint. Further, amount of deviation from Benford's law in compressed images can be used to find forgery.
<br>
• Performed simulations to detect multiple compressions for JPEG images, glare detection in UCID images etc.
<br><br>
<a href="files/EE604_tp.pdf">report</a>
</a> </p>
</td>
</tr>
<tr >
<td width="30%"><img id="img-opt" src="img/data_depth.png" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="files/dd_Report.pdf">
<papertitle>Non-Parametric Method for Indoor Fire-Fighter Tracking using Data Depth-based Localizers </papertitle></a><br>
<em>Supervisor: <a href="http://home.iitk.ac.in/~rhegde/">Dr R.M.Hegde </a>, Indian Institute of Technology Kanpur</em><br>
<p style="text-align:justify"> The project aimed at developing Maximum depth-based, kNN depth-based, and Hybrid MD-kNN Localizers for ad-hoc Sensor Networks. It have performance improvement in terms of computation time and robustness.
<br><br>
• Implemented depth-based localizer using Matlab and R codes for various depth functions like Tukey, Liu, Oja, L1, and Mahalanobis Depth. Analysed time complexity and robustness of various depth functions.
<br>
• Performed training at each of the grid points using the offline collected observation vectors from all the anchors in the network and mapped the online observation vector to the appropriate grid point in the network using localizers.
<br>
• Analysed Intel Lab Data for assessing robustness, localization success rate, and computation time of the localizers.
<br><br>
<a href="files/dd_Report.pdf">report</a>
</a> </p>
</td>
</tr>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td>
<heading style="font-size:22px">Internships</heading>
</td>
</tr>
<tr >
<td width="30%"><img id="img-pgp" src="img/samsung.jpeg" alt="project_img" width="160" style="border-style: none">
</td>
<td valign="top" width="70%">
<p><a href="http://www.samsung.com/in/aboutsamsung/samsungelectronics/india/rnd/">
<papertitle>Samsung Research and Development Institute, Bengaluru (SRIB) [May'15-June'15] </papertitle></a><br>
<em>Supervisor: Srinivas Rao Kudavelly, Principal Engineer, Innovation & Enterprise Biz Division/HME(Health and Medical Equipments)/Ultrasound) </em> <br>
<p style="text-align:justify">Pyramidal Implementation of Lucas-Kanade-Tomasi (LKT) Feature Tracking Algorithm for 3D Images
<br><br>
• The project aimed at C++ implemention of Lucas-Kanade-Tomasi (LKT) Feature tracking algorithm for Ultrasound 3D image (echocardiogram). Pyramidal Implementation of the above algorithm has performance improvement in terms of local accuracy and robustness. <br>
<!-- • Tracked simple geometrical objects and its affine-transformed version with very high accuracy. <br>
• Used the algorithm for tracking real world scenarios (ultrasound 3D image volume) with reasonably high accuracy. <br> -->
• Analysed the sensitivity of algorithm to various parameters. <br> <br>
Got <b>Pre-Placement offer</b> for full-time position at Samsung.
</p>
</td>
</tr>
<tr>
<div id="bottom">
<table width="80%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td>
<br>
<p align="center">
<a href="mailto:abykumar12011@gmail.com" class="fa fa-google"></a>
<a href="https://www.linkedin.com/in/abhaykumar3/" class="fa fa-linkedin"></a>
<a href="skype:abykumar12011?userinfo" class="fa fa-skype"></a>
<a href="https://www.facebook.com/abhay.lost" class="fa fa-facebook"></a>
<a href="https://www.instagram.com/abhay.lost/" class="fa fa-instagram"></a>
<a href="#" class="fa fa-twitter"></a>
</p>
<p align="center"><font size="3">
<a href="#top">Go to top</a>
</font>
</p>
<p align="right"><font size="0.5">
<a href="http://www.cs.berkeley.edu/~barron/">inspired from this website</a>
</font>
</p>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
<!-- Default Statcounter code for Personal Webpage https://abhayk1201.github.io
-->
<script type="text/javascript">
var sc_project=11871076;
var sc_invisible=1;
var sc_security="c4b776df";
var sc_https=1;
var sc_remove_link=1;
</script>
<script type="text/javascript"
src="https://www.statcounter.com/counter/counter.js" async></script>
<noscript><div class="statcounter"><img class="statcounter"
src="//c.statcounter.com/11871076/0/c4b776df/1/" alt="Web
Analytics"></div></noscript>
<!-- End of Statcounter Code -->
</body>
</html>